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Abstract 

Supervised manifold learning methods learn data representations by preserving the geomet¬ 
ric structure of data while enhancing the separation between data samples from different 
classes. In this work, we propose a theoretical study of supervised manifold learning for 
classification. We consider nonlinear dimensionality reduction algorithms that yield linearly 
separable embeddings of training data and present generalization bounds for this type of 
algorithms. A necessary condition for satisfactory generalization performance is that the 
embedding allow the construction of a sufficiently regular interpolation function in relation 
with the separation margin of the embedding. We show that for supervised embeddings 
satisfying this condition, the classification error decays at an exponential rate with the 
number of training samples. Finally, we examine the separability of supervised nonlinear 
embeddings that aim to preserve the low-dimensional geometric structure of data based on 
graph representations. The proposed analysis is supported by experiments on several real 
data sets. 

Keywords: Manifold learning, dimensionality reduction, classification, out-of-sample 

extensions, RBF interpolation 

1. Introduction 


In many data analysis problems, data samples have an intrinsically low-dinrensional struc¬ 
ture although they reside in a high-dimensional ambient space. The learning of low¬ 
dimensional structures in collections of data has been a well studied topic of the last two 


decades ( 

Tenenbaum et al. 

2000 

), (Roweis and Saul, 

2000], ( 

Belkin and Niyogi 

2003 

), (He 

and Niyogi, 

2004 

), (Donoho and Grimes 

2003 

), ( 

Zhang and Zha 

2005 

). Following these 


works, many classification methods have been proposed in the recent years to apply such 
manifold learning techniques to learn classifiers that are adapted to the geometric struc- 


ture of low-dimensional data ( 

Hua et al. 

2012 

>, ( 

Yang et al. 

2011 

), ( 

Zhang et al. 

2012), 

(Sugiyama 

2007 

), ( 

Raducanu and Dornaika, 2012). The common approach in such works 


is to learn a data representation that enhances the between-class separation while preserv¬ 
ing the intrinsic low-dimensional structure of data. While many efforts have focused on 


*. Most part of the work was performed while the first author was in INRIA. 
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the practical aspects of learning such supervised embeddings for training data, the gen¬ 
eralization performance of these methods as supervised classification algorithms has not 
been investigated much yet. In this work, we aim to study nonlinear supervised dimension¬ 
ality reduction methods and present performance bounds based on the properties of the 
embedding and the interpolation function used for generalizing the embedding. 


Several supervised manifold learning methods extend the Laplacian eigenmaps algo¬ 


rithm (Belkin and Niyogi, 2003), or its linear variant LPP (He and Niyogi, 2004) to the 
classification problem. The algorithms proposed by Hua et al. (2012), Yang et al. (2011), 


Zhang et al. (2012) provide a supervised extension of the LPP algorithm and learn a linear 


projection that preserves the proximity of neighboring samples from the same class, while 
increasing the distance between nearby samples from different classes. The method by 


Sugiyama (2007) proposes an adaptation of the Fisher metric for linear manifold learning, 


which is in fact shown to be equivalent to the above methods by Yang et al. (2011), Zhang 


et al. 

(2012 

). In ( 

Li et al. 

2013 

>, ( 

Cui and Fan, 

2012 

), ( 

Wang and Chen, 

2009), some other 


similar Fisher-based linear manifold learning methods are proposed. In (Raducanu and 


Dornaika, 2012) a method relying on a similar formulation as in (Hua et al., 2012), (Yang 


et al. 2011), (Zhang et ah, 2012) is presented, which, however, learns a nonlinear embed¬ 


ding. The main advantage of linear dimensionality reduction methods over nonlinear ones 
is that the generalization of the learnt embedding to novel (initially unavailable) samples 
is straightforward. However, nonlinear manifold learning algorithms are more flexible as 
the possible data representations they can learn belong to a wider family of functions, e.g., 
one can always find a nonlinear embedding to make training samples from different classes 
linearly separable. On the other hand, when a nonlinear embedding is used, one must also 
determine a suitable interpolation function to generalize the embedding to new samples, 
and the choice of the interpolator is critical for the classification performance. 


The common effort in all of these supervised dimensionality reduction methods is to 
learn an embedding that increases the separation between different classes, while preserving 
the geometric structure of data. It is interesting to note that supervised manifold learning 
methods achieve separability by reducing the dimension of data, while kernel methods in 
traditional classifiers achieve this by increasing the dimension of data. Meanwhile, making 
training data linearly separable in supervised manifold learning does not mean much only 
by itself. Assuming that the data are sampled from a continuous distribution (hence two 
samples coincide with 0 probability), it is almost always possible to separate a discrete 
set of samples from different classes with a nonlinear embedding, e.g., even with a simple 
embedding such as the one mapping each sample to a vector encoding its class label. What 
actually matters is how the embedding generalizes to test data, i.e., where the test samples 
will be mapped to in the low-dimensional domain of embedding and how well the perfor¬ 
mance will be. The generalization for test data is straightforward for kernel methods, it is 
determined by the underlying main algorithm. However, in nonlinear supervised manifold 
learning, this question has rather been overlooked so far. In this work we aim to fill this 
gap and look into the generalization capabilities of supervised manifold learning algorithms. 
We study the conditions that must be satisfied by the embedding of the training samples 
and the interpolation function for satisfactory generalization of the classifier. We then ex¬ 
amine the rates of convergence of supervised manifold learning algorithms that satisfy these 
conditions. 
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In Section[2j we consider arbitrary supervised manifold learning algorithms that compute 
a linearly separable embedding of training samples. We study the generalization capability 
of such algorithms for two types of out-of-sample interpolation functions. We first consider 
arbitrary interpolation functions that are Lipschitz-continuous on the support of each class, 
and then focus on out-of-sample extensions with radial basis function (RBF) kernels, which 
is a popular family of interpolation functions. For both types of interpolators, we derive 
conditions that must be satisfied by the embedding of the training samples and the regu¬ 
larity of the interpolation function that generalizes the embedding to test samples, when 
a nearest neighbor or linear classifier is used in the low-dimensional domain of embedding. 
These conditions enforce the Lipschitz constant of the interpolator to be sufficiently small, 
in comparison with the separation margin between training samples from different classes 
in the low-dimensional domain of embedding. The practical value of these results resides 
in their implications about what must really be taken into account when designing a su¬ 
pervised dimensionality reduction algorithm: Achieving a good separation margin does not 
suffice by itself; the geometric structure must also be preserved so as to ensure that a suffi¬ 
ciently regular interpolator can be found to generalize the embedding to the whole ambient 
space. We then particularly consider Gaussian RBF kernels and show the existence of an 
optimal value for the kernel scale by studying the condition in our main result that links 
the separation with the Lipschitz constant of the kernel. 


Our results in Section [2] also provide bounds on the rate of convergence of the classifi¬ 
cation error of supervised embeddings. We show that the misclassification error probability 
decays at an exponential rate with the number of samples, provided that the interpolation 
function is sufficiently regular with respect to the separation margin of the embedding. 
These convergence rates are higher than those reported in previous results on RBF net¬ 
works (Niyogi and Girosi, 1996), (Lin et al. 2014), (Hernandez-Aguirre et aTT| 2002), and 


regularized least-squares regression algorithms (Caponnetto and De Vito 2007), (Steinwart 


et al. 2009). The essential difference between our results and such previous works is that 


those assume a general setting and do not focus on a particular data model, whereas our 
results are rather relevant to settings where the support of each class admits some certain 
structure, so as to allow the existence of an interpolator that is sufficiently regular on the 
support of each class. Moreover, in contrast with these previous works, our bounds are 
independent of the ambient space dimension and vary only with the intrinsic dimensions of 
the class supports as they characterize the error in terms of the covering numbers of the 
supports. 


The results in Section [2] assume an embedding that makes training samples from different 
classes linearly separable. Even if most nonlinear dimensionality reduction methods are 
observed to yield separable embeddings in practice, we aim to verify this theoretically in 
Section [3} In particular, we focus on the nonlinear version of the supervised Laplacian 


eigenmaps embeddings (Raducanu and Dornaika, 2012), (Hua et al. 
2011), (Zhang et al. 


2012), (Yang et al, 


2012). Supervised Laplacian eigenmaps methods embed the data with 


the eigenvectors of the linear combination of two graph Laplacian matrices that encode the 
links between neighboring samples from the same class and different classes. In such a data 
representation, the coordinates of neighboring data samples change slowly within the same 
class and rapidly across different classes. We study the conditions for the linear separability 
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of these embeddings and characterize their separation margin in terms of some graph and 
algorithm parameters. 

In Section |4j we evaluate our results with experiments on several object and face data 
sets. We study the implications of the condition derived in Section [2] on the separability 
margin - interpolator regularity tradeoff. The experimental comparison of several supervised 
dimensionality reduction algorithms shows that this compromise between the separation and 
interpolator regularity can indeed be related to the practical classification performance of a 
supervised manifold learning algorithm. This suggests that, one can possibly improve the 
accuracy of supervised dimensionality reduction algorithms by considering more carefully 
the generalization capability of the embedding during the learning. We then study the 
variation of the classification performance with parameters such as the sample size, the 
RBF kernel scale, and the dimension of the embedding, in view of the generalization bounds 
presented in Section [2] Finally, we conclude in Section [5] 

2. Performance bounds for supervised manifold learning methods 
2.1 Notation and Problem Formulation 

Consider a setting with M data classes where the samples of each class m G {1 ,,M} are 
drawn from a probability measure u m in a Hilbert space H such that v m has a bounded 
support M. m C H. Let X = {xj }^ 1 C H be a set of N training samples such that 
each Xi is drawn from one of the probability measures u m , and the samples drawn from 
each v m are independent and identically distributed. We denote the class label of Xi by 
Ci G {1,2,..., M}. 

Let Y = {yi}iLi C be a d-dimensional embedding of X, where each y t corresponds to 
Xi- We consider supervised embeddings such that Y is linearly separable. Linear separability 
is defined as follows: 

Definition 1 The data representation Y is linearly separable with a margin of 7 > 0, if 
for any two classes k,l G {1,2, there exists a separating hyperplane defined by 

Uki G M d , HwjmII = 1 and bki G M such that 

Vkl Vi + b kl > 7/2 if Ci = k 

uh Vi + b kt < -7/2 if Ci = l. 

The above definition of separability implies the following. For any given class m, 
there exists a set of hyperplanes {iw m k}k^m C M d , ||u; m fc|| = L and a set of real num¬ 
bers {b m k}k^m C M that separate class m from other classes, such that for all yt of class 
Ci = m 

m mk Vi T b m k > 7/2; yk m ( 2 ) 

and for all yi of class Ci 7 ^ m, there exists a k such that 

w mk Vi + b mk < — 7/2- (3) 

These hyperplanes are obtained by setting ojk m = —w m ki b km = —b m k- 

Figure [l] shows an illustration of a linearly separable embedding of data samples from 
two classes. Manifold learning methods typically compute a low-dimensional embedding 
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Figure 1: Illustration of a linearly separable embedding. Data in X are sampled from two 
different classes with supports M\, M 2 ■ The samples X are mapped to the 
coordinates Y with a low-dimensional embedding, where the two classes become 
linearly separable with margin 7 with the hyperplane given by 00 , b. 


Y of training data X in a pointwise manner, i.e., the coordinates yt are computed only 
for the initially available training samples X{. However, in a classification problem, in 
order to estimate the class label of a new data sample x of unknown class, x needs to be 
mapped to the low-dimensional domain of embedding as well. The construction of a function 
/ : H —> that generalizes the learnt embedding to the whole space is known as the out- 
of-sample generalization problem. Smooth functions are commonly used for out-of-sample 


interpolation, e.g. as in (Qiao et al. 2013), (Peherstorfer et ah, 2011) 


Now let x be a test sample drawn from the probability measure v m , hence, the true class 
label of x is m. In our study, we consider two basic classification schemes in the domain of 
embedding: 

Linear classifier. The embeddings of the training samples are used to compute the 
separating hyperplanes, i.e., the classifier parameters {uj m k} and {b m k}. Then, mapping x 
to the low-dimensional domain as f(x) G M. d , the class label of x is estimated as C(x) = l 
if there exists l G {1 ,,M} such that 


^Tk /(*) + bik >0, Vfc G {1,..., M} \ {l}. 


( 4 ) 


Note that the existence of such an l is not guaranteed in general for any x, but for a given 
x there cannot be more than one l satisfying the above condition. Then x is classified 
correctly if the estimated class label agrees with the true class label, i.e., C(x) = l = m. 

Nearest neighbor classification. The test sample x is assigned the class label of the 
closest training point in the domain of embedding, i.e., C(x) = Cy, where 

i' = arg min \\yi~f(x)\\ 


In the rest of this section, we study the generalization performance of supervised dimen¬ 
sionality reduction methods. We first consider in Section 2.2 interpolation functions that 
vary regularly on each class support and we search for a lower bound on the probability of 
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correctly classifying a new data sample in terms of the regularity of /, the separation of 
the embedding, and the sampling density. Then in Section 2.3 we study the classification 
performance for a particular type of interpolation functions, namely RBF interpolators, 


which is one of the most popular ones (Peherstorfer et ah, 2011), (Chin and Suter 2008). 


We focus particularly on Gaussian RBF interpolators in Section 2.4 and derive some results 
regarding the existence of an optimal kernel scale parameter. Lastly, we discuss our results 


in comparison with previous literature in Section 2.5 


In the results in Sections 2.2 2.4 we keep a generic formulation and simply treat the 
supports {Ai m } as arbitrary bounded subsets of H , each of which represents a different data 
class. Nevertheless, from the perspective of manifold learning, our results are of interest 
especially when the data is assumed to have an underlying low-dimensional structure. In 
Section |2.5[ we study the implications of our results for the setting where M m are low¬ 


dimensional manifolds. We then examine how the proposed bounds vary in relation to the 
intrinsic dimensions of {Ai m }. 


2.2 Out-of-sample interpolation with regular functions 


Let / : H — > be an out-of-sample interpolation function such that f(xi) = m for each 
training sample Xi, i = 1 Assume that / is Lipschitz continuous with constant 

L > 0 when restricted to any one of the supports A4 m ; i.e., for any m G {1,..., M} and 
any u, v G M. m 

ll/O) - f(v )II < L\\u-v\\ 

where || ■ || denotes above the i 2 -norm if the argument is in M d , and the norm induced from 
the inner product in H if the argument is in H . 

We will find a relation between the classification accuracy and the number of training 
samples via the covering number of the supports A4 m . Let B e (x) C H denote an open ball 
of radius e around x 

B e {x) = {u G H : \\x — it|| < e}. 


The covering number jV(e, A) of a set A C H is defined as the smallest number of open 
balls B t of radius e whose union contains A (Kulkarni and Posner, 1995) 


k 

A f(e, A) = inf{fc : 3 u\, ..., G H s.t. A C |^J B € (v,i)}. 

1=1 


We assume that the supports M. m are totally bounded, i.e., M m has a finite covering 
number for any e > 0 . 

We state below a lower bound for the probability of correctly classifying a sample x 
drawn from is m , in terms of the number of training samples drawn from v m , the separation 
of the embedding and the regularity of /. 

Theorem 2 For some e with 0 < e < 7 /( 2 L), let the training set X contain at least N m 
samples drawn i.i.d. according to a probability measure v m such that 


N m >N{e/2,M m ). 


Let Y be an embedding of the training samples X that is linearly separable with margin larger 
than 7 , and let f be an interpolation function that is Lipschitz continuous with constant L 
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on the support M rn - Then the probability of correctly classifying a test sample x drawn from 
u m independently from the training samples with the linear classifier 0 is lower bounded 


as 


P 


x) = m ) > 1 — 


J\f(e/2,M r 
2 N m 


The proof of the theorem is given in Appendix A.l Theorem [2] establishes a link 


between the classification performance and the separation of the embedding of the training 
samples. In particular, due to the condition e < 7/(2 L), the increase in the separation 7 
allows a larger value for e, provided that the interpolator regularity is not affected much. 
This reduces the covering number J\T(e/2, Ai m ) in return and increases the probability of 
correct classification. Similarly, from the condition e < 7/(2 L), one can also observe that 
at a given separation 7, a smaller Lipschitz constant L for the interpolation function allows 
the parameter e to take a larger value. This reduces the covering number M(e/2, M m ) 
and therefore increases the correct classification probability. Thus, choosing a more regular 
interpolator at a given separation helps improve the classification performance. If the e 
parameter is fixed, the Lipschitz constant of the interpolator is allowed to increase only 
proportionally to the separation margin. The condition that the interpolator must be 
sufficiently regular in comparison with the separation suggests that increasing the separation 
too much at the cost of impairing the interpolator regularity may degrade the classifier 
performance. In the case that the supports Ai m are low-dimensional manifolds, the covering 
number AT(e/2, A4 m ) increases at a geometric rate with the intrinsic dimension D of the 
manifold, since a D-dimensional manifold is locally homeomorphic to M-°. Therefore, from 
the condition on the number of samples, N m should increase at a geometric rate with D. 

In Theorem [2] the probability of misclassification decreases with the number N m of 
training samples at a rate of 0(N~ l 1 ). In the rest of this section, we show that it is in 
fact possible to obtain an exponential convergence rate with linear and NN-classifiers under 
certain assumptions. We first present the following lemma. 

Lemma 3 Let X = c H be a set of training samples such that each 37 is drawn 

i.i.d. from one of the probability measures {%}„=[■ Let x be a test sample randomly drawn 
according to the probability measure v m of class m. Let 


A = {xi G X : Xi G B s (x),Xi ~ u m } 


( 5 ) 


be the set of samples in X that are in a 5-neighborhood of x and also drawn from the measure 
u m . Assume that A contains \A\ = Q samples. Then 


P ( ||/(2;) - ^ f( x j)W - LS + ^ I ^ 1 “ 2d exp ( - 


Qe 2 \ 

2L 2 5 2 ) ' 


( 6 ) 


Lemma [ 3 ] is proved in Appendix A.2 The inequality in ([6]) shows that as the number Q 
of training samples falling in a neighborhood of a test point x increases, the probability of 
the deviation of f(x) from its average within the neighborhood decreases. The parameter 
e captures the relation between the amount and the probability of deviation. 

When studying the classification accuracy in the main result below, we will use the 
following generalized definition of the linear separation. 


7 






Vural and Guillemot 


Definition 4 Let Y be a linearly separable embedding with margin 7 such that each pair 
(k, l ) of classes are separated with the hyperplanes given by uiki, bki as defined in Definition 

S We say that the linear classifier given by {ujki}, {bki} has a Q-mean separability margin of 
7 q > 0 if any choice of Q samples {yk,i}i = i C Y from class k and Q samples {yi,i}f=i C Y 
from class l, l 7^ k, satisfies 


T 

u ki 

T 

u kl 



+ b k i > Iq/Z 


+ bki < — 7 q / 2 - 


( 7 ) 


The above definition of separability is more flexible than the one in Definition [lj Clearly, 
an embedding that is linearly separable with margin 7 has a Q-mean separability margin 
of 7 q > 7 for any Q. As in the previous section, we consider that the test sample x is 
classified with the linear classifier Q in the low-dimensional domain, defined with respect 
to the set of hyperplanes given by {oJ m k } and {b m k} as in ([2]) and Cl¬ 
in the following result, we show that an exponential convergence rate can be obtained 
with linear classifiers in supervised manifold learning. We define beforehand a parameter 
depending on 6, which gives the smallest possible measure of the (5-neighborhood Bg(x) of 
a point x in support M m . 

rim ,5 '■= inf v m {B s {x)). 

XEMm 


Theorem 5 Let X — «£ i c H be a set of training samples such that each x% is drawn 
i.i.d. from one of the probability measures {v m }m=i- Let Y be an embedding of X in 
that is linearly separable with a Q-mean separability margin larger than 7 q. For a given 
e > 0 and 6 > 0, let f be a Lipschitz-continuous interpolator such that 

L5 + Vde<^~. ( 8 ) 

Consider a test sample x randomly drawn according to the probability measure v m of class 
m. If X contains at least N m training samples drawn i.i.d. from u m such that 


N m > 


Q 

Vm, S 


then the probability of correctly classifying x with the linear classifier given in Q is lower 
bounded as 


P 


x) = m ) > 1 — exp ( — 


2 (N m rj m ,s - Q) 2 
Nm. 


— 2d exp — 


Qe 2 ) 
2L 2 5 2 ) 


( 9 ) 


Theorem [5] is proved in Appendix A.3 The theorem shows how the classification ac¬ 


curacy is influenced by the separation of the classes in the embedding, the smoothness of 
the out-of-sample interpolant, and the number of training samples drawn from the density 
of each class. The condition in (|8|) points to the tradeoff between the separation and the 
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regularity of the interpolation function. As the Lipschitz constant L of the interpolation 
function / increases, / becomes less “regular”, and a higher separation 7 q is needed to 
meet the condition. This is coherent with the expectation that, when / becomes irregular, 
the classifier becomes more sensitive to the perturbations of the data, e.g., due to noise. 
The requirement of a higher separation is then for ensuring a larger margin in the linear 
classifier, which compensates for the irregularity of /. From ([8]), it is also observed that the 
separation should increase with the dimension d as well, and also with e, whose increase 
improves the confidence of the bound Q. Note that the condition in ^ implies also the 
following: When computing an embedding, it is not advisable to increase the separation of 
training data unconditionally. In particular, increasing the separation too much may violate 
the preservation of the geometry and yield an irregular interpolator. Hence, when designing 
a supervised dimensionality reduction algorithm, one must pay attention to the regularity 
of the resulting interpolator as much as the enhancement of the separation margin. 

Next, we discuss the roles of the parameters Q and 5. The term exp(— Q e 2 /(2L 2 h 2 )) in 
the correct classification probability bound ([9]) shows that, for fixed 6, the confidence in¬ 
creases with the value of Q. Meanwhile, due to the numerator of the term exp(—2 (N m r) m j— 
Q) 2 /N m ), for a high confidence, the number of samples N m should also be relatively big 
with respect to Q to have a high overall confidence. Similarly, at fixed Q, 5 should be 
made smaller to increase the confidence due to the term exp(— (Q e 2 ) / {2L 2 5 2 )), which then 
reduces the parameter r] m and eventually requires the number of samples N m to take a 
sufficiently large value in order to make the term exp(—2 (N m r] mt s — Q) 2 /N m ) small and 
have a high confidence. Therefore, these two parameters Q and 5 behave in a similar way, 
and determine the relation between the number of samples and the correct classification 
probability, i.e., they indicate how large N m should be in order to have a certain confidence 
of correct classification. 

Theorem [5] studies the setting where the class labels are estimated with a linear classi¬ 
fier in the domain of embedding. We also provide another result below that analyses the 
performance when a nearest-neighbor classifier is used in the domain of embedding. 

Theorem 6 Let X = Will c H be a set of training samples such that each Xi is drawn 
i.i.d. from one of the probability measures {u m }^ =1 . Let Y be an embedding of X in 
such that 

|| Vi ~ Vj || < D s , if ||Xi - xj || < 6 and Ci = Cj 
11 Hi ~~ Uj 11 > T > if 7^ Cj , 

hence, nearby samples from the same class are mapped to nearby points, and samples from 
different classes are separated by a distance of at least 7 in the embedding. 

For given e > 0 and 5 > 0, let f be a Lipschitz-continuous interpolation function such 
that 

Lb + yfde + D 25 ^ —. (10) 

Consider a test sample x randomly drawn according to the probability measure v m of 
class m. If X contains at least N m training samples drawn i.i.d. from v m such that 


Vm,S 
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then the probability of correctly classifying x with nearest-neighbor classification in is 
lower bounded as 


P 




> 1 — exp 


2( NmVm,S ~ Q) 2 \ 
N m J 


2d exp 


Qe 2 ) 

2L 2 S 2 J ' 


( 11 ) 


Theorem [6] is proved in Appendix A.4 Theorem [6] is quite similar to Theorem [5] and 
can be interpreted similarly. Unlike in the previous result, the separability condition of the 
embedding is based on the pairwise distances of samples from different classes here. The 
condition (10) suggests that the result is useful when the parameter D 25 is sufficiently small, 
which requires the embedding to map nearby samples from the same class in the ambient 
space to nearby points. 

In this section, we have characterized the regularity of the interpolation functions via 
their rates of variation when restricted to the supports A4 m . While the results of this 
section are generic in the sense that they are valid for any interpolation function with the 
described regularity properties, we have not examined the construction of such functions. 
In a practical classification problem where one uses a particular type of interpolation func¬ 
tions, one would also be interested in the adaptation of these results to obtain performance 
guarantees for the particular type of function used. Hence, in the following section we focus 
on a popular family of smooth functions; radial basis function (RBF) interpolators, and 
study the classification performance of this particular type of interpolators. 


2.3 Out-of-sample interpolation with RBF interpolators 

Here we consider an RBF interpolation function f : H IR rf of the form 

/w = i/‘w/ 2 w...rt>)i 

such that each component f k of / is given by 

N 

f k (x) = - x i\\) 

2—1 

where (j) : M —> M + is a kernel function, c k G M are coefficients, and Xi are kernel centers. 
In interpolation with RBF functions, it is common to choose the set of kernel centers as 
the set of available data samples. Hence, we assume that the set of kernel centers 
is selected to be the same as the set of training samples X. We consider a setting where 
the coefficients c k are set such that f(xi) = yi, i.e., / maps each training point in X to its 
embedding previously computed with supervised manifold learning. 

We consider the RBF kernel ^ to be a Lipschitz continuous function with constant 
L ^ > 0, hence, for any «,»£l 

| <t>{u) - 4>(v)\ < L<p \u - v\. 

Also, let C be an upper bound on the coefficient magnitudes such that for all k = 1 ,..., d 

N 

E l4l s c. 

2—1 
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In the following, we analyze the classification accuracy and extend the results in Section 
2.2 to the case of RBF interpolators. We first give the following result, which probabilisti¬ 
cally bounds how much the value of the interpolator / at a point x randomly drawn from 
v m may deviate from the average interpolator value of the training points of the same class 
within a neighborhood of x. 

Lemma 7 Let X = Will c H be a set of training samples such that each Xi is drawn 
i.i.d. from one of the probability measures {n m }rn =\. Let x be a test sample randomly drawn 
according to the probability measure n m of class m. Let 


A = {xi G X : Xi E B s {x),Xi ~ v m } 


( 12 ) 


be the set of samples in X that are in a 5-neighborhood of x and also drawn from the measure 
u m . Assume that A contains \A\ = Q samples. Then 


P 11/0*0 - TV 


) - ^ /O^OH ^ VdCiL^d + e) j > 1 - 21Vexp f - 
X n G ^4 J \ 


(Q - !)e 2 

2 L\d 2 


(13) 


The proof of Lemma [7] is given in Appendix A.5 The lemma states a result similar to 
the one in Lemma |3j however, is specialized to the case where / is an RBF interpolator. 
We are now ready to present the following main result. 

Theorem 8 Let X = {*i}l i C H be a set of training samples such that each Xi is drawn 
i.i.d. from one of the probability measures {%}„ = i■ Let Y be an embedding of X in M. d 
that is linearly separable with a Q-mean separability margin larger than yg. For a given 
e > 0 and 6 > 0, let f be an RBF interpolator such that 


VdC (Lf/,5 + e) < 


(14) 


Consider a test sample x randomly drawn according to the probability measure of class 
m. If X contains at least N m training samples drawn i.i.d. from v m such that 


N m > 


Q 

Vm,5 


then the probability of correctly classifying x with the linear classifier given in 0 is lower 
bounded as 


P 


x) = m ) > 1 — exp — 


2 (Am r?m,<5 ~ Qf 

Nm 


— 2N exp — 


{Q ~ 1) e 2 

2L 2 J 2 


(15) 


The theorem is proved in Appendix | A. 6[ The theorem bounds the classification accuracy 
in terms of the smoothness of the RBF interpolation function and the number of samples. 


The condition in (14) characterizes the compromise between the separation and the reg¬ 


ularity of the interpolator, which depends on the Lipschitz constant of the RBF kernels 
and the coefficient magnitude. As the Lipschitz constant L^ and the coefficient magnitude 
parameter C increase (i.e., / becomes less “regular”), a higher separation yg is required 
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to provide a performance guarantee. When the separation margin of the embedding and 
the interpolator satisfy the condition in the misclassihcation probability decays ex¬ 

ponentially as the number of training samples increases, similarly to the results in Section 
[2721 

Theorem [8] studies the misclassihcation probability when the class labels in the low¬ 
dimensional domain are estimated with a linear classifier. We also present below a bound 
on the misclassihcation probability when the nearest-neighbor classiher is used in the low¬ 
dimensional domain. 

Theorem 9 Let X = c H be a set of training samples such that each Xi is drawn 

i.i.d. from one of the probability measures {v m }m= i- Let Y be an embedding of X in M. d 
such that 


|| y% - yj II < Ds, if II Xi - Xj || < 5 and Q = Cj 
II Vi Uj II ^ T > */ Ci Cj. 

For given e > 0 and 5 > 0, let f be an RBF interpolator such that 

VdC (L<p5 + e) + Z? 2 (S < (16) 

Consider a test sample x randomly drawn according to the probability measure v m of 
class m. If X contains at least N m training samples drawn i.i.d. from v m such that 


Vm,5 


then the probability of correctly classifying x with nearest-neighbor classification in is 
lower bounded as 


P 




> 1 — exp 


2 (N m r] mt s ~ Q) 2 \ 
N m ) 


2N exp 


(Q-i)e 2 \ 

H 62 )' 


(17) 


Theorem [9] is proved in Appendix |A.7[ While it provides the exact convergence rate as 
in Theorem[8j the necessary condition in ( |16[ ) includes also the parameter D 25 . Hence, if the 
embedding maps nearby samples from the same class to nearby points, and a compromise 
is achieved between the separation and the interpolator regularity, the misclassihcation 
probability can be upper bounded. 


2.4 Optimizing the scale of Gaussian RBF kernels 

In data interpolation with RBFs, it is known that the accuracy of interpolation is quite 
sensitive to the choice of the shape parameter for several kernels including the Gaussian 
kernel (Baxter, 1992). The relation between the shape parameter and the performance of 
interpolation has been an important problem of interest (Piret, 2007). In this section, we 
focus on the Gaussian RBF kernel, which is a popular choice for RBF interpolation due to 
its smoothness and good spatial localization properties. We study the choice of the scale 
parameter of the kernel within the context of classification. 
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We consider the RBF kernel given by 


_r 2 

(t>( r ) = e ^ 


where a is the scale parameter of the Gaussian function. We focus on the condition (14) in 
Theorem [8] 

\fdC (L'pd + e) < 7 q/ 2, 


(or equivalently the condition ( |16[ ) if the nearest neighbor classifier is used), which relates the 
interpolation function properties with the separation. In particular, for a given separation 
margin, this condition is satisfied more easily when the term on the left hand side of the 
inequality is smaller. Thus, in the following, we derive an expression for the left hand side 
of the above inequality by deriving the Lipschitz constant and the coefficient bound C in 
terms of the scale parameter a of the Gaussian kernel. We then study the scale parameter 
that minimizes \fdC (L^d + e). 

Writing the condition f{xi) = yt in a matrix form for each dimension k = 1,..., d, we 
have 

$c fc = y k (18) 


where $ G M. NxN is a matrix whose (?', j)-th entry is given by &ij = cj)(\\xi — Xj\\), c k G M Afxl 
is the coefficient vector whose i-th entry is c k , and y k G M Arxl is the data coordinate vector 
giving the k- th dimensions of the embeddings of all samples, i.e., y k = Y^. Assuming that 
the embedding is computed with the usual scale constraint Y T Y = I, we have ||t/ fc || = 1. 
The norm of the coefficient vector can then be bounded as 


< 11 $ 


-l 


= 11 $ 


-ii 


(19) 


In the rest of this section, we assume that the data X are sampled from the Euclidean 


space, i.e., H = M n . We first use a result by 

Narcowich et al. ( 

1994 

) in order to bound the 

norm $ 1 of the inverse matrix. From ( 

Narcowich et al., 

1994 

Theorem 4.1) we gelr] 


$ _1 || < (3 a~ n e a<r2 


where a > 0 and (3 > 0 are constants depending on the dimension n and the minimum 


distance between the training points X (separation radius) (Narcowich et al., 1994). As the 
fi-norm of the coefficient vector can be bounded as ||c fc ||i < v/A/ - 11c fe 11, from (19) one can set 
the parameter C that upper bounds the coefficients magnitudes as 


L = aa e 


where a = (3\/N. 

Next, we derive a Lipschitz constant for the Gaussian kernel <f>(r) in terms of a. Setting 
the second derivative of 4> to zero 


d 2 (h ( 4r 2 

dr 2 I o" 4 



= 0 


1. The result stated in (Narcowich et al. 


1994 


Theorem 4.1) is adapted to our study by taking the measure 
as /3(p) = 5{p ~ Po) so that the RBF kernel defined in (Narcowich et al. 1994| (1.1)) corresponds to a 
Gaussian function as F(r) = exp(—po r 2 ). The scale of the Gaussian kernel is then given by er = po -1 ' 2 ■ 
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we get that the maximum value of \d(j>/dr\ is attained at r = a/^/2. Evaluating \d<j>/dr\ at 
this value, we obtain 


La, = \[2e zcr 1 


Now rewriting the condition (14) of the theorem, we have 

VdC (L^6 + e) = aicr- ri - 1 e“ T2 


+ a 2 cr n e aa < 7 q /2 


where a\ = \pld a e 1 / 2 5 and a 2 = \[d a e. We thus determine the Gaussian scale parameter 
a that minimizes 


771 /_\ ,, n—l„acr 2 : „ „—n„ao 2 

r [a) = a\(J e + o 2 cr e 

First, notice that as a —> 0 and a oo, the function F(a) —> oo. Therefore, it has at least 
one minimum. Setting 


(LF 

da 


= e aa a~ n ~ 2 


(2 ao 2 cr 3 + 2aai<7 2 — a 2 n<r — ai(n + 1)) = 0 


we need to solve 


2 aa 2 er 3 + 2aa\a 2 — a 2 ncr — a\(n + 1) = 0. 


The leading and the second-degree coefficients are positive, while the first-degree and the 
constant coefficients are negative in the above cubic polynomial. Then, the sum of the roots 
is negative and the product of the roots is positive. Therefore, there is one and only one 
positive root a op t, which is the unique minimizer of F(a). 

The existence of an optimal scale parameter 0 < a op t < oo for the RBF kernel can be 
intuitively explained as follows. When a takes too small values, the support of the RBF 
function concentrated around the training points does not sufficiently cover the whole class 
supports M m - This manifests itself in (14) with the increase in the term L^, which indicates 
that the interpolation function is not sufficiently regular. This weakens the guarantee that 
a test sample will be interpolated sufficiently close to its neighboring training samples from 
the same class and mapped to the correct side of the hyperplane in the linear classifier. 


On the other hand, when a increases too much, the stability of the linear system (18) is 
impaired and the coefficients c increase too much. This results in an overfitting of the 
interpolator and, therefore, decreases the classification performance. Hence, the analysis in 
this section provides a theoretical justification of the common knowledge that a should be 
set to a sufficiently large value while avoiding overfitting. 


2.5 Discussion of the results in relation with previous results 

In Theorems [8] and [9j we have presented a result that characterizes the performance of 
classification with RBF interpolation functions. In particular, we have considered a setting 
where an RBF interpolator is fitted to each dimension of a low-dimensional embedding 
where different classes are separable. Our study has several links with RBF networks or 
least-squares regression algorithms. In this section, we interpret our findings in relation 
with previously established results. 

Several previous works study the performance of learning by considering a probability 
measure p defined on X x Y. where X and Y are two sets. The “label” set Y is often taken 
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as an interval [-L, L\. Given a set of data pairs {(xj,yj)}jL 1 sampled from the distribution 
p , the RBF network estimates a function / of the form 

R 

/(*) = (~^) ' ( 20 ) 

The number of RBF terms R may be different from the number of samples N in general. 
The function / minimizes the empirical error 

N 

f = argmin ^ (f(xj) - yjf . 

f 3=1 


The function / estimated from a finite collection of data samples is often compared to 


the regression function (Cucker and Smale 2002) 


fo(x ) = U dp(y\x) 


where dp(y\x) is the conditional probability measure on Y. The regression function f a 
minimizes the expected risk as 


fo = arg min / (f(x) - y) dp. 

f J.XxY 


As the probability measure p is not known in practice, the estimate / of f a is obtained from 
data samples. Several previous works have characterized the performance of learning by 
studying the approximation error (Niyogi and Girosi, 1996), (|Lin et ah 2014) 


E [(fo - f ) 2 ] = / (fo(x) - f(x)) 2 dp x (,x) 


IX 


( 21 ) 


where px is the marginal probability measure on X. This definition of the approximation 
error can be adapted to our setting as follows. In our problem the distribution of each class 
is assumed to have a bounded support, which is a special case of modeling the data with an 
overall probability distribution p. If the supports A4 m are assumed to be nonintersecting, 
the regression function f a is given by 


M 

fo(x ) = ^2 mI m(x) 

m—1 


which corresponds to the class labels rn = 1,... ,M, where I m is the indicator function of 
the support M. m . It is then easy to show that the approximation error E[(/ 0 — /) 2 ] can be 
bounded as a constant times the probability of misclassification P(C(x) ^ m). Hence, we 
can compare our misclassification probability bounds in Section [2.3| with the approximation 
error in other works. 

The study in (Niyogi and Girosi, 1996) assumes that the regression function is an element 
of the Bessel potential space of a sufficiently high order and that the sum of the coefficients 
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|cj| is bounded. It is then shown that for data sampled from M n , with probability greater 


than 1 — 5 the approximation error in (21) can be bounded as 


n(fo-m<o(^)+o 


IRnlog(RN) — log(5) j 


( 22 ) 


The analysis by Lin et al. (2014) considers families of RBF kernels that include the 
Gaussian function. Supposing that the regression function f Q is of Sobolev class WJ, and 

n 

that the number of RBF terms is given by R = N n + 2r in terms of the number of samples 
N, the approximation error is bounded as 


Hifo - f ) 2 ] < 0 (N~^-r log 2 (A0). 


(23) 


Next, we overview the study by Hernandez-Aguirre et al 


(2002), which studies the 
performance of RBFs in a Probably Approximately Correct (PAC)-learning framework. For 
X C R n , a family T of measurable functions from X to [0,1] is considered and the problem 
of approximating a target function /o known only through examp les with a function in 
/ £ T is studied. The authors use a previous result from (Vidyasagar, 1997) that relates 


the accuracy of empirical risk minimization to the covering number of J- and the number of 
samples. Combining this result with the bounds on covering number estimates of Lipschitz 


continuous functions (Kolmogorov and Tihomirov, 1961), the following result is obtained for 


PAC function learning with RBF neural networks with Gaussian kernel. Let the coefficients 
be bounded as |cj| < A, a common scale parameter be chosen as <7* = a, and E[|/o — /|] be 
computed under a uniform probability measure p. Then if the number of samples satisfies 


8 , f V2RnA\ 

? og (; ' 


; _1 / 2 crC / 


(24) 


an approximation of the target function is obtained with accuracy parameter e and confi¬ 
dence parameter 

P(E[|/ 0 - /|] > e) < C- (25) 

In the above expression, the expectation is over the test samples, whereas the probability 
is over the training samples; i.e., over all possible distributions of training samples, the 
probability of having the average approximation error larger than e is bounded. Note that, 
our results in Theorems [8] and [9j when translated into the above PAC-learning framework, 
correspond to a confidence parameter of ( = 0. This is because the misclassification prob¬ 
ability bound of a test sample is valid for any choice of the training samples, provided that 
the condition (14) (or the condition (|16[) ) holds. Thus, in our result the probability running 
over the training samples in (25) has no counterpart. When we take ( = 0, the above result 
does not provide a useful bound since N —> oo as C —> 0. By contrast, our result is valid 
only if the conditions (14), (16) on the interpolation function holds. It is easy to show 
that, assuming nonintersecting class supports M. m , the expression E[|/o — /1] is given by a 
constant times the probability of misclassification. The accuracy parameter e can then be 
seen as the counterpart of the misclassification probability upper bound given on the right 
hand sides of (15) and © (the expression subtracted from 1). At fixed N, the dependence 
of the accuracy on the kernel scale parameter is monotonic in the bound (24); e decreases 
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as a increases. Therefore, this bound does not guide the selection of the scale parameter of 


the RBF kernel, while the discussion in Section 2.4 (confirmed by the experimental results 


in Section 4.2) suggests the existence of an optimal scale. 


Finally, we mention some results on the learning performance of regularized least squares 


regression algorithms. In (Caponnetto and De Vito, 2007) optimal rates are derived for the 


regularized least squares method in a Reproducing Kernel Hilbert Space (RKHS) in the 
minimax sense. It is shown that, under some hypotheses concerning the data probability 
measure and the complexity of the family of learnt functions, the maximum error (yielded 
by the worst distribution) obtained with the regularized least squares method converges at 


a rate of 0{1/N). Next, the work in (Steinwart et al., 2009) shows that, in regularized least 
squares regression over a RKHS, if the eigenvalues of the kernel integral operator decay 
sufficiently fast, and if the foo-norms of regression functions can be bounded, the error of 
the classifier converges at a rate of up to 0(1/N) with high probability. Steinwart et al. 
also examine the learning performance in relation with the exponent of the function norm 
in the regularization term and show that the learning rate is not affected by the choice of 
the exponent of the function norm. 


We now overview the three bounds given in (22), (23), and (24) in terms of the depen¬ 


dence of the error on the number of samples. The results in (22) and (23) provide a useful 


bound only in the case where the number of samples N is larger than the number of RBF 
terms R, contrary to our study where we treat the case R = N. If it is assumed that N is 
sufficiently larger than R, the result in (22) predicts a rate of decay of only 0(y/\og(N)/N) 
in the misclassification probability. The bound in (23) improves with the Sobolev regularity 


of the regression function; however, the dependence of the error on the number of samples 


is of a similar nature to the one in (22). Considering e as a misclassification error parameter 
in the bound in (24), the error decreases at a rate of 0(N ~ 1//2 ) as the number of samples 


increases. The analysis in (Caponnetto and De Vito, 2007) and ( |Steinwart et al. 2009) also 
provide the similar rates of convergence of 0(N~ 1 ). Meanwhile, our results in Theorems 
[8] and [9] predict an exponential decay in the misclassification probability as the number of 
samples N increases (under the reasonable assumption that N m = 0(N ) for each class m). 
The reason why we arrive at a more optimistic bound is the specialization of the analysis to 
the considered particular setting, where the support of each class is assumed to be restricted 
to a totally bounded region in the ambient space, as well as the assumed relations between 
the separation margin of the embedding and the regularity of the interpolation function. 

Another difference between these previous results and ours is the dependence on the 
dimension. The results in (22), (23), and (24) predict an increase in the error at the 


respective rates of 0(^/n), 0(e _1 / n ), and 0(^/\ogn) with the ambient space dimension n. 
While these results assume that the data X C M n is in an Euclidean space of dimension 
n, our study assumes the data X to be in a generic Hilbert space H. The results in 
Theorems 5]j8 involve the dimension d of the low-dimensional space of embedding and 
does not explicitly depend on the dimension of the ambient Hilbert space H (which could 
be infinite-dimensional). However, especially in the context of manifold learning, it is 
interesting to analyze the dependence of our bound on the intrinsic dimension of the class 
supports M m . 
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In order to put the expressions ©, @> in a more convenient form, let us reduce one 
parameter by setting Q = N m rj m ^/2. Then the misclassification probability is of 


O exp(-iY m ?7^ i(5 ) + IV exp - 


N n 


Vm,S { 


Ll 5 2 


We can relate the dependence of this expression on the intrinsic dimension as follows. 
Since the supports M m are assumed to be totally bounded, one can define a parameter 0 
that represents the “diameter” of A4 m , i.e., the largest distance between any two points 
on A4 m . Then the measure f] m $ of the minimum ball of radius 5 in M. m is of 0((5 /@) D ), 
where D is the intrinsic dimension of A4 m . Replacing this in the above expression gives the 
probability of misclassification as 


O 



N m 6 2D \ 

©2D J 


+ N exp 


))' 


This shows that in order to retain the correct classification guarantee, as the intrinsic 
dimension D grows, the number of samples N m should increase at a geometric rate with D. 
In supervised manifold learning problems, data sets usually have a low intrinsic dimension, 
therefore, this geometric rate of increase can often be tolerated. Meanwhile the dimension of 
the ambient space is typically high, so that performance bounds independent of the ambient 
space dimension are of particular interest. 


3. Separability of supervised nonlinear embeddings 

In the results in Section [2j we have presented generalization bounds for classifiers based on 
linearly separable embeddings. One may wonder if the separability assumption is easy to 
satisfy when computing structure-preserving nonlinear embeddings of data. In this section, 
we try to answer this question by focusing on a particular family of supervised dimensionality 
reduction algorithms, i.e., supervised Laplacian eigenmaps embeddings, and analyze the 
conditions of separability. We first discuss the supervised Laplacian eigenmaps embeddings 
in Section 3.1 and then present results in Section 3.2 about the linearly separability of these 
embeddings. 


3.1 Supervised Laplacian eigenmaps embeddings 

Let X = C H be a set of training samples, where each Xi belongs to one of M 

classes. Most structure-preserving supervised manifold learning algorithms rely on a graph 
representation of data. Consider a weighted data graph G each vertex of which represents 
a point Xi . We write Xi ~ Xj , or simply i ~ j if the vertices Xi, Xj are neighbors and denote 
the weight of the edge by Wij > 0. The weights Wij are usually determined as a positive 
and monotonically decreasing function of the distance between Xi and Xj in H , where the 
Gaussian function is a common choice. Nevertheless, we maintain a generic formulation 
here without making any assumption on the neighborhood or weight selection strategies. 

Now let G w and Gb represent two subgraphs of G, which contain the edges of G that 
are respectively within the same class and between different classes. Hence, G w contains 
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an edge i j between samples Xj and Xj, if i ~ j and Cj = Cj. Similarly, G b contains 
an edge i j if i ~ j and Cj / Cj. We assume that all vertices of G are contained in 
both G w and G b ] and that G w has exactly M connected components such that the training 
samples in each class form a connected component]^ We also assume that G w and Gb do 
not contain any isolated vertices; i.e., each data sample Xi has at least one neighbor in both 
graphs. 

The N x N weight matrices W w and Wb of G w and Gb have entries as follows. 


W w (i,j) 


w b (i,j) 


Wij if i ~ j and Cj = Cj 
0 otherwise 

if i ~ j and Cj ^ Cj 
0 otherwise 


Let d w (i) and d b (i) denote the degrees of x'j in G w and G b 


d w (i) = 22 W V’ db W = 22 Wi i 

j~wi j~bi 

and D Wl D b denote the N xN diagonal degree matrices given by D w (i , i) = d w (i), D b (i, i) = 
d b (i). The normalized graph Laplacian matrices L w and Lb of G w and G b are then defined 
as 

L w := D~ l/2 {D W - W w )D~ 1 / 2 , L b := D; 1/2 (D b - W b )D~ 1/2 ■ 

Supervised extensions of the Laplacian eigenmaps and LPP algorithms seek a d-dimensional 
embedding of the data set X, such that each x t is represented by a vector yi G M <ixl . De¬ 
noting the new data matrix as Y = [t/i y-i ■ • • Vn] T £ M 7Vxd , the coordinates of data samples 
are computed by solving the problem 

“Minimize tr (Y T L W Y) while maximizing tr (Y T LbY)." (26) 


The reason behind this formulation can be explained as follows. For a graph Laplacian 
matrix L = D _1 / 2 (D — W)D~ 1 / 2 , where D and W are respectively the degree and the 
weight matrices, defining the coordinates Z = D~ 1 / 2 Y normalized with the vertex degrees, 
we have 


tr (Y t LY) = tr (Z t (D - W)Z) = ||zj - Zjfwtj (27) 




where Zi is the 7-th row of Z giving the normalized coordinates of the embedding of the data 
sample Xj. Hence, the problem in (26) seeks a representation Y that maps nearby samples 
in the same class to nearby points, while mapping nearby samples from different classes to 
distant points. In fact, when the samples x'j are assumed to come from a manifold M. the 
term y T Ly is the discrete equivalent of 


/ l|V/(x-)|| 2 dx 
J M 

2. The straightforward application of common graph construction strategies, like connecting each training 
sample to its K-nearest neighbors or to its neighbors within a given distance, may result in several 
disconnected components in a single class in the graph if there is much diversity in that class. However, 
this difficulty can be easily overcome by introducing extra edges to bridge between graph components 
that are originally disconnected. 
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where / : Ai —> M is a continuous function on the manifold that extends the one-dimensional 
coordinates y to the whole manifold. Hence, the term tr (Y T LY) captures the rate of change 
of the learnt coordinate vectors Y over the underlying manifold. Then, in a setting where 
the samples of different classes come from M different manifolds {A^ m }^f =1 , the formulation 
in (26) looks for a function that has a slow variation on each manifold M m , while having a 
fast variation “between” different manifolds. 

The supervised learning problem in (26) has so far been studied by several authors with 
slight variations in their problem formulations. Raducanu and Dornaika (2012) minimize a 
weighted difference of the within-class and between-class similarity terms in ( |26[ ) in order to 
learn a nonlinear embedding. Meanwhile, linear dimensionality reduction methods pose the 
manifold learning problem as the learning of a linear projection matrix P £ M dxn ; therefore, 
solve the problem in (26) under the constraint yi = PXi, where x* £ M Tlxl and d < n. Hua 


et al. (2012) formulate the problem as the minimization of the difference of the within- 


class and the between-class similarity terms in (26) as well. Thus, their algorithm can be 


seen as the linear version of the method by Raducanu and Dornaika (2012). Sugiyama 


(2007) proposes an adaptation of the Fisher discriminant analysis algorithm to preserve 
the local structures of data. Data sample pairs are weighted with respect to their affinities 
in the construction of the within-class and the between-class scatter matrices in Fisher 
discriminant analysis. Then the trace of the ratio of the between-class and the within-class 
scatter matrices is maximized to learn a linear embedding. Meanwhile, the within-class 
and the between-class local scatter matrices are closely related to the two terms in (26) as 
shown by Yang et al. (2011). The terms Y T L W Y and Y T L&Y, when evaluated under the 
constraint yi = PXi, become equal to the locally weighted within-class and between-class 
scatter matrices of the projected data. Cui and Fan (2012) and Wang and Chen (2009) 


propose to maximize the ratio of the between-class and the within-class local scatters in the 
learning. Yang et al. (2011) optimize the same objective function, while they construct the 
between-class graph only on the centers of mass of the classes. Zhang et al. (2012) similarly 
optimize a Fisher metric to maximize the ratio of the between- and within-class scatters; 
however, the total scatter is also taken into account in the objective function in order to 
preserve the overall manifold structure. 

All of the above methods use similar formulations of the supervised manifold learning 
problem and give comparable results. In our study, we base our analysis on the following 
formal problem definition 


mintr (Y T L W Y) — ^tr(Y T L&Y) subject to Y T Y = I 


Y 


(28) 


which minimizes the difference of the within-class and the between-class similarity terms 

Here I is the 


2012 

) and ( 

Hua et al., 

2012 ) 


d x d identity matrix and g > 0 is a parameter adjusting the weights of the two terms. 
The condition Y T Y = I is a commonly used constraint to remove the scale ambiguity of 
the coordinates. The solution of the problem (28) is given by the first d eigenvectors of the 
matrix 

corresponding to its smallest eigenvalues. 

Our purpose in this section is then to theoretically study the linear separability of the 
learnt coordinates of training data, with respect to the definition of linear separability 
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given in ([Tj) . In the following, we determine some conditions on the graph properties and 
the weight parameter /i that ensure the linear separability. We derive lower bounds on the 
margin 7 and study its dependence on the model parameters. Let us give beforehand the 
following definitions about the graphs G w and G &. 


Definition 10 The volume of the subgraph of G w that corresponds to the connected com¬ 
ponent containing samples from class k is 


Vfc .— ^ ' d w {i). 

v. C-i=k 

We define the maximal within-class volume as 

Vmax ■— max Vfc. 
k=l,...,M 

The volume of the component of Gb containing the edges between the samples of classes k 
and l is 0 

V&:= E 2w iT 

Ci=k,Cj=l 

We then define the maximal pairwise between-class volume as 

Vmax ■= maxV^. 


In a connected graph, the distance between two vertices Xi and Xj is the number of 
edges in a shortest path joining xi and xj. The diameter of the graph is then given by the 
maximum distance between any two vertices in the graph (Chung 1996). We define the 
diameter of the connected component of G 


corresponding to class k as follows. 


Definition 11 For any two vertices Xi and x 3 such that Ci = Cj = k, consider a within- 
class shortest path joining Xi and Xj, which contains samples only from class k. Then the 
diameter D *. of the connected component of G w corresponding to class k is the maximum 
number of edges in the within-class shortest path joining any two vertices Xi and Xj from 
class k. 


Definition 12 The minimum edge weight within class k is defined as 


^nnn.k •— mill W{j. 
i~wj 
Ci=Cj=k 


3. In order to keep the analogy with the definition of 14, a 2 factor is introduced in this expression as each 
edge is counted only once in the sum. 
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3.2 Separability bounds for two classes 


We now present a lower bound for the linear separability of the embedding obtained by 
solving (28) in a setting with two classes Ci E {1,2}. We first show that an embedding 


of dimension d = 1 is sufficient to achieve linear separability for the case of two classes. 
We then derive a lower bound on the separation in terms of the graph parameters and the 
algorithm parameter y. 


Consider a one-dimensional embedding Y = y = [yi yz ■ ■ ■ Dn] T £ 


X 1 


where yi E 


is the coordinate of the data sample x, in the one-dimensional space. The coordinate vector 
y is given by the eigenvector of L w — yLf, corresponding to its smallest eigenvalue. We begin 
with presenting the following result, which states that the samples from the two classes are 
always mapped to different halves (nonnegative or nonpositive) of the real line. 

Lemma 13 The learnt embedding y of dimension d = 1 satisfies 

yi < 0 if Ci = 1 (or respectively Ci=2) 

yi > 0 if Ci = 2 (or respectively Ci=l) 

for any y > 0 and for any choice of the graph parameters. 


Lemma 13 is proved in Appendix |B.1[ The lemma states that in one-dimensional embed¬ 
dings of two classes, samples from different classes always have coordinates with different 
signs. Therefore, the hyperplane given by u = 1, 6 = 0 separates the data as co T yi < 0 for 
Ci = 1 and uFyi > 0 for Ci = 2 (since the embedding is one dimensional, the vector ui is 
a scalar in this case). However, this does not guarantee that the data is separable with a 
positive margin 7 > 0. In the following result, we show that a positive margin exists and 
give a lower bound on it. In the rest of this section, we assume without loss of generality 
that classes 1 and 2 are respectively mapped to the negative and positive halves of the real 


axis. 


— 1/2 

Theorem 14 Defining the normalized data coordinates z = D w ' y, 


let 


Zl,max ■= max Zi 
i:Ci =1 


Z-l.rmn ■= mm Zi 
i: Ci =2 


denote the maximum and minimum coordinates that classes 1 and 2 are respectively mapped 
to with a one-dimensional embedding learnt with supervised Laplacian eigenmaps. We also 
define the parameters 


Wmin = mm 

fce{i,2} 


W m in,k 

Dk 


A = 


dw (i) 
db{i ) 


fimax = max fa 
i 


where Df. is the diameter of the graph corresponding to class k as defined in Definition 


11 


Then, if the weight parameter is chosen such that 0 < p < w m i n /(PmaxV* 


supervised Laplacian embedding of dimension d > 
margin lower bounded as below: 


any 

1 is linearly separable with a positive 


Z2,r. 


- Z L 


max _ 


> 


1 


VFr, 


1 - 


hPmaxVmax 


W r . 


(29) 
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The proof of Theorem 14 is given in Appendix B.2 The proof is based on a variational 
characterization of the eigenvector of L w — yL^ corresponding to its smallest eigenvalue, 
whose elements are then bounded in terms of the parameters of the graph such as the 
diameters and volumes of its connected components. 


Theorem 14 states that an embedding learnt with the supervised Laplacian eigenmaps 
method makes two classes linearly separable if the weight parameter y is chosen sufficiently 
small. In particular, the theorem shows that, for any 0 < 6 < V max 1 ^ 2 ) a choice of the 
weight parameter y satisfying 


0 < y < 


w r 


Pmax Vmc 


\/Vmax < 5 ^ 


guarantees a separation of Z 2 ,min ~ zi,mai > <5 between classes 1 and 2 at d = 1. Here, we 
use the symbol <5 to denote the separation in the normalized coordinates 2 . In practice, 
either one of the normalized eigenvectors z or the original eigenvectors y can be used for 
embedding the data. If the original eigenvectors y are used, due to the relation y = D v j z, 
we can lower bound the separation as y 2 ,mm - Vl ,max > y/d Wt min(z2,min ~ zi,max) where 
d w ,min = min,; d w (i). Thus, for any embedding of dimension d > 1, there exists a hyperplane 
that results in a linear separation with a margin 7 of at least 


7 > 



' l^fimax Vmc 


W r 


Next, we comment on the dependence of the separation on y. The inequality in 
shows that the lower bound on the separation Z2,min — z i,max has a variation of 0(1 — y/y) 
with the weight parameter y. The fact that the separation decreases with the increase in y 
seems counterintuitive at first; this parameter weights the between-class dissimilarity in the 
objective function. This can be explained as follows. When y is high, the algorithm tries to 
increase the distance between neighboring samples from different classes as much as possible 
by moving them away from the origin (remember that different classes are mapped to the 
positive and the negative sides of the real line). However, since the normalized coordinate 
vector z has to respect the equality z T D w z = 1 , the total squared norm of the coordinates 
cannot be arbitrarily large. Due to this constraint, setting y to a high value causes the 
algorithm to map non-neighboring samples from different classes to nearby coordinates 
close to the origin. This occurs since the increase in y reduces the impact of the first term 
y T L w y in the overall objective and results in an embedding with a weaker link between the 
samples of the same class. This causes a polarization of the data and eventually reduces 
the separation. Hence, the y parameter should be carefully chosen and should not take too 
large values. 

Theorem 14 characterizes the separation at d = 1 in terms of the distance between the 
supports of the two classes. Meanwhile, it is also interesting to determine the individual 
distances of the supports of the two classes to the origin. In the following corollary, we 
present a lower bound on the distance between the coordinates of any sample and the 
origin. 
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Corollary 15 The distance between the supports of the first and the second classes and the 
origin in a one-dimensional embedding is lower bounded in terms of the separation between 
the two classes as 


min{|,zi )max |, 


n ^ 1 fimin / ~ 

Z2,min\j — 7\ ~n v ^2, min %1 ,max 

^ Pmax 


where 


fimin = min fa, /3 max = m&xfii. 
i i 

Corollary [15] is proved in Appendix |B.3[ The proof is based on a Lagrangian formulation 
of the embedding as a constrained optimization problem, which then allows us to establish 
a link between the separation and the individual distances of class supports to the origin. 
The corollary states a lower bound on the portion of the overall separation lying in the 
negative or the positive sides of the real line. In particular, if the vertex degrees are equal 
for all samples in G w and Gb (which is the case, for instance, if all vertices have the same 
number of neighbors and a constant weight of Wij = 1 is assigned to the edges), since 
fimin = Pmaxi the portions of the overall separation in the positive and negative sides of 
the real line will be equal. Although the statement of Theorem [14] is sufficient to show the 
existence of separating hyperplanes with positive margins for the embeddings of two classes, 
we will see in Section [3.3| that the separability with a hyperplane passing through the origin 
as in Corollary [15] is a desirable property for the extension of these results to a multi-class 
setting. 

3.3 Separability bounds for multiple classes 

In this section, we study the separability of the embeddings of multiple classes with the 
supervised Laplacian eigenmaps algorithm. In particular, we focus on a setting with multiple 
classes that can be grouped into several categories. The classes in each category are assumed 
to bear a relatively high resemblance within themselves, whereas the resemblance between 
classes from different categories is weaker. This is a scenario that is likely to be encountered 
in several practical data classification problems. 

In the following, we study the embeddings of multiple categorizable classes. The ob¬ 
jective matrix L w — /aLb defining the embedding is close to a block-diagonal matrix if the 
between-category similarities are relatively low. Building on this observation, we present 
a result that links the separability of the overall embedding to the separability of the em¬ 
beddings of each individual category with the same algorithm. Especially in a setting with 
many classes, this simplifies the problem for multiple classes and makes it possible to de¬ 
duce information for the overall separation by studying the separation of the individual 
categories, which is easier to analyze. 

We consider data samples X = {xi}^ =l belonging to M different classes that can be 
categorized into Q groups. For the purpose of our theoretical analysis, let us focus for a 
moment on the individual categories and consider the embedding of the samples in each 
category q with the supervised Laplacian eigenmaps algorithm if the data graph was con¬ 
structed only within the category q. Let Y q be the d q - dimensional embedding of category 
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q. Assume that Y q is separable with margin r ) c . Then for any two classes k,l in category 
q, there exists a hyperplane oj^i such that 

W kl y q i > 7 C / 2 if Ci = k o 

Vi < -7 C / 2 if Ci = l 

where y\ is the z-th row of Y q defining the coordinates of the z-th data sample in category 
q. Note that an offset of bki = 0 is assumed here, i.e., the classes in each category are 
assumed to be separable with hyperplanes passing through the origin. While this is mainly 
for simplifying the analysis, the studied supervised Laplacian eigenmaps algorithm in fact 
computes embeddings having this property in practice (the theoretical guarantee for the 
two-class setting being provided in Corollary [l5| ) . 

Now let L = L w — /zL;, denote the N x N objective matrix defining the embedding of X 
with supervised Laplacian eigenmaps. Also, let L c = L c w — pL^ denote the block-diagonal 
objective matrix where the within-class and the between-class Laplacians L r w and L f are 
obtained by restricting the graph edges to the ones within the categories. In other words, L c 
is obtained by removing the edges between all pairs of data samples belonging to different 
categories. 

Let L nc = L — L c denote the component of L arising from the between-category data 
connections. In our analysis, we will treat this component L nc as a perturbation on the 
block-diagonal matrix L c and analyze the eigenvectors of L accordingly in order to study 
the separability of the embedding obtained with L. 

We will need a condition on the separation of the eigenvalues of L c . Let zy denote the 
minimal separation (the smallest difference) between the eigenvalues of L c 

77 := min |Aj — Xj\ (31) 

where A* are the eigenvalues of L c for z = 1 ,,N. For /z > 0 and a random sampling of 
data, the eigenvalues of L c are expected to be distinct]^} therefore, one can reasonably assume 
the minimal eigenvalue separation to be positive. The characterization of the behavior of 
the minimal separation of the eigenvalues depending on the graph properties is not within 
the scope of this study and remains as future work. 

We state below our main result about the separability of the embeddings of multiple 
categorizable classes. 


Theorem 16 Let L = L w — pL & E M> NxN be the matrix representing the objective function 
of the supervised Laplacian eigenmaps algorithm with M classes categorizable into Q groups. 
Assume that L is close to a block-diagonal objective matrix L c containing only within- 
category edges such that the perturbation L nc = L — L c is bounded as 


L nc II < 


V 

2 


in terms of the minimal eigenvalue separation g of the matrix L c defined in (31). Let each 
category q have a d q -dimensional embedding Y q separable with margin r ) c as in (30). We 


4. Note that the within-class and the between-class Laplacians LJ), and LI are normalized Laplacians; 
therefore, the constant vector is not an eigenvector and 0 is not a repeating eigenvalue. 
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define the parameters 

4||i7 lc || 2 \ 1/2 
V 2 ) 

and 

C = (2 - 2 £ + 2 v * * 6 7 N (1 - £ 2 )) 1/2 . 

Then there exists an embedding Y of dimension d = Ylq=i d q consisting of the eigenvectors 
of the overall objective matrix L that is separable with a margin of at least 



7 = 7 c /\/2 - 2C 


provided that ( < ^ c /(2^/2). 


The proof of Theorem [16] is given in Appendix |B.4| The proof is based on first analyzing 
the separation of the embedding corresponding to the block-diagonal component L c of the 
objective matrix, and then lower bounding the separation of the original embedding in 
terms of the perturbation and the separation of the eigenvalues. In brief, the theorem says 
that if the classes are categorizable with sufficiently low between-category edge weights, 
and if the individual embedding of each category makes all classes in that category linearly 
separable, then in the embedding computed for the overall data graph with the supervised 
Laplacian eigenmaps algorithm, all pairs of classes (from same and different categories) 
are also linearly separable. This extends the linear separability of individual categories to 
the separability of all classes. The margin of the overall separation decreases at a rate 
of o(vT^p-i) as the magnitude of the non-block-diagonal component ||L nc || of the 
objective matrix increases]^] 

The dimension of the separable embedding is given by the sum of the dimensions of 
the individual embeddings of the categories that ensure the linear separability within each 
category, hence, the dimension required for linear separability must be linearly proportional 
to the number of categories. In order to compute the exact value of the number of dimensions 
required for linear separability, one needs the knowledge of the number of dimensions that 
ensures the separability within each category. Nevertheless, the provided result is still 
interesting as the theoretical or numerical analysis of individual categories is often easier 
than the analysis of the whole data set, since the number of classes in a particular category 
is more limited]^] Note that, one can also interpret the theorem by considering each class 
as a different category. However, in this case the edges between samples of different classes 
must have sufficiently low weights for the applicability of the theorem, i.e., the non-block 
diagonal component L nc of the Laplacian must be sufficiently small. The examination 
of the general problem of embedding data with multiple non-categorizable classes and no 
assumptions of the edge weights between different classes seems to be a more challenging 
problem and remains as a future direction to study. 


5. From the definition of the parameters and 7 in Theorem 


16 


we have £ = 0(^/1 — ||L nc || 2 ) 

O(vT^f), 7 = 0(1 - 0- It follows that 7 = 0(1 - (1 - yT) 1/2 ) « 0(^1- 


c = 


6. For instance, we have theoretically shown in Section 3.2 that one dimension is sufficient for obtaining a 


linearly separable embedding of two classes. While we do not provide a theoretical analysis for more than 
two classes, we have experimentally observed that data becomes linearly separable at two dimensions 
when the number of classes is three or four. 
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(a) Quadratic surfaces (b) Swissrolls (c) Spheres 

Figure 2: Data sampled from two-dimensional synthetical surfaces. Red and blue colors 
represent two different classes. 



Figure 3: Supervised Laplacian embeddings of data sampled from quadratic surfaces. 


4. Experimental Results 

In this section, we present results on synthetical and real data sets. We compare several 
supervised manifold learning methods and study their performances in relation with our 
theoretical results. 

4.1 Separability of embeddings with supervised manifold learning 

We first present results on synthetical data in order to study the embeddings obtained with 
supervised dimensionality reduction. We test the supervised Laplacian eigenmaps algorithm 
in a setting with two classes. We generate samples from two nonintersecting and linearly 
nonseparable surfaces in M 3 that represent two different classes. We experiment on three 
different types of surfaces; namely, quadratic surfaces, Swiss rolls and spheres. The data 
sampled from these surfaces are shown in Figure [2j We choose N = 200 samples from each 
class. We construct the graph G w by connecting each sample to its Ji-nearest neighbors 
from the same class, where I\ is chosen between 20 and 30. The graph Gb is constructed 
similarly, where each sample is connected to its Kj 5 nearest neighbors from the other class. 
The graph weights are determined as a Gaussian function of the distance between the 
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(a) Experimental value of the separation 7 (b) Theoretical upper bound for fx that guar¬ 
antees a separation of at least 7 


Figure 4: Variation of the separation 7 between the two classes with the parameter fx for 
the synthetic data sets 


samples. The embeddings are then computed by minimizing the objective function in (28). 


The one-dimensional, two-dimensional, and three-dimensional embeddings obtained for the 
quadratic surface are shown in Figure [3j where the weight parameter is taken as /x = 0.57 
(to have a visually clear embedding for the purpose of illustration). Similar results are 
obtained on the Swiss roll and the spherical surface. One can observe that the data samples 
that were initially linearly nonseparable become linearly separable when embedded with 
the supervised Laplacian eigenmaps algorithm. The two classes are mapped to different 
(positive or negative) sides of the real line in Figure [3(a)| as predicted by Lemma 13 The 
separation in the 2-D and 3-D embeddings in Figure [3] is close to the separation obtained 
with the 1-D embedding. 


We then compute and plot the separation obtained at different values of /x. Figure 4(a) 
shows the experimental value of the separation 7 = 22 ,mm — zi,m ax obtained with the 1-D 
embedding for the three types of surfaces. Figure |4(b)| shows the theoretical upper bound 
for /r in Theorem 14 that guarantees a separation of at least 7 . Both the experimental value 
and the theoretical bound for the separation 7 decrease with the increase in the parameter 


/r. This is in agreement with (29), which predicts a decrease of 0(1 — JI ) in the separation 
with respect to /i. The theoretical bound for the separation is seen to decrease at a relatively 
faster rate with fx for the Swiss roll data set. This is due to the particular structure of this 
data set with a nonuniform sampling density where the sampling is sparser away from the 
spiral center. The parameter w m i n then takes a small value, which consequently leads to a 
fast rate of decrease for the separation due to (29). Comparing Figures 4( a)|and |4(b)[ one 
observes that the theoretical bounds for the separation are numerically more pessimistic 
than their experimental values, which is a result of the fact that our results are obtained 
with a worst-case analysis. Nevertheless, the theoretical bounds capture well the actual 
variation of the separation margin with fi. 
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(a) COIL -20 object data set (b) ETH -80 object data set 



Figure 5: Comparison of the performance of several supervised classification methods 


4.2 Classification performance of supervised manifold learning algorithms 


We now study the overall performance of classification obtained in a setting with supervised 
manifold learning, where the out-of-sample generalization is achieved with smooth RBF 
interpolators. We evaluate the theoretical results of Section [2] on three real data sets: the 
COIL-20 object database (Nene et al., 1996|), the Yale face database ( |Georghiades et aI7 


2001), and the ETH-80 object database (Leibe and Schiele, 2003). The COIL-20, Yale face, 


and ETH-80 databases contain a total of 1420, 2204 and 3280 images from 20, 38 and 8 
image classes respectively. The images in these three data sets are converted to greyscale, 
normalized, and downsampled to a resolution of respectively 32 x 32, 20 x 17, and 20 x 20 
pixels. 


4.2.1 Comparison of supervised manifold learning to baseline classifiers 

We first compare the performance of supervised manifold learning with some reference clas¬ 
sification methods. The performances of SVM, K-NN, kernel regression, and the supervised 
Laplacian eigenmaps method are evaluated and compared. Figure [5] reports the results 
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obtained on the COIL-20 data set, the ETH-80 data set, the Yale data set, and a subset 
of the Yale data set consisting of its first 10 classes (reduced Yale data set). The SVM, K- 
NN, and kernel regression algorithms are applied in the original domain. In the supervised 
Laplacian eigenmaps method, the embedding of the training images into a low-dimensional 
space is computed. Then, an out-of-sample interpolator with Gaussian RBFs is constructed 


that maps the training samples to their embedded coordinates as described in Section 2.3 


Test samples are mapped to the low-dimensional domain via the RBF interpolator and 
the class labels of test samples are estimated via nearest-neighbor classification in the low¬ 
dimensional domain. The plots in Figure [5] show the variation of the misclassification rate 
of test samples in percentage with the ratio of the number of training samples in the whole 
data set. The results are the average of 5 repetitions of the experiment with different 
random choices for the training and test samples. 

The results in Figure [5] show that the best results are obtained with the supervised 
Laplacian eigenmaps algorithm in the COIL-20 and ETH-80 datasets. It is interesting to 
compare Figures 5(c) and |5(d)1 The SVM classifier in the original domain performs nearly 
the same as supervised Laplacian eigenmaps in the reduced version of the Yale database 
with 10 classes; however, in the full data set with 38 classes supervised Laplacian eigenmaps 
outperforms SVM and gives the most accurate results. This can be explained with the fact 
that the linear separability of different classes degrades as the number of classes increases, 
thus causing the performance of SVM to decrease, as well as that of K-NN and kernel 
regression classifiers. Meanwhile, the performance of the supervised Laplacian eigenmaps 
method is not much affected by the increase in the number of classes. The K-NN and kernel 
regression classifiers are seen to give almost the same performance in the plots in Figure 
[5j The number of neighbors is set as K = 1 for the K-NN algorithm in these experiments, 
where it has been observed to attain its best performance; and the scale parameter of the 
kernel regression algorithm is optimized to get the best accuracy, which has turned out 
to take relatively small values. Hence the performances of these two classifiers practically 
correspond to that of the nearest-neighbor classifier in the original domain. 


4.2.2 Variation of the error with algorithm parameters and sample size 

We first study the evolution of the classification error with the number of training samples. 
Figures [6(a)| - |6(c)| show the variation of the misclassification rate of test samples with respect 
to the total number of training samples N for the COIL-20, ETH-80 and Yale data sets. 
Each curve in the figure shows the errors obtained at a different value of the dimension d 
of the embedding. The decrease in the misclassification rate with the number of training 
samples is in agreement with the results in Section [2] as expected. 

The results of Figure [6] are replotted in Figure [7j where the variation of the misclassifi¬ 
cation rate is shown with respect to the dimension d of the embedding at different N values. 
It is observed that there may exist an optimal value of the dimension that minimizes the 


misclassification rate. This can be interpreted in light of the conditions (14) and (16) in 


Theorems [8] and |9j which impose a lower bound on the separability margin yg in terms of 
the dimension d of the embedding. In the supervised Laplacian eigenmaps algorithm, the 
first few dimensions are critical and effective for separating different classes. The decrease in 
the error with the increase in the dimension for small values of d can be explained with the 
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(a) COIL -20 object data set (b) ETH -80 object data set 



(c) Yale face data set 


Figure 6 : Variation of the misclassification rate with the number of training samples 


fact that the separation increases with d at small d, thereby satisfying the conditions (14), 
(16). Meanwhile, the error may stagnate or increase if the dimension d increases beyond a 
certain value, as the separation does not necessarily increase at the same rate. 

We then examine the variation of the misclassification rate with the separation. We 
obtain embeddings at different separation values 7 by changing the parameter /r of the 
supervised Laplacian eigenmaps algorithm. Figure [ 8 ] shows the variation of the misclas¬ 
sification rate with the separation 7 . Each curve is obtained at a different value of the 
scale parameter a of the RBF kernels. It is seen that the misclassification rate decreases 
in general with the separation for small 7 values. This is in agreement with our results, 
as the conditions (14), (16) require the separation to be higher than a threshold. On the 
other hand, the possible increase in the error at relatively large values of the separation is 
due to the following. These parts of the plots are obtained at very small /r values, which 
typically result in a deformed embedding with a degenerate geometry. The deformation 
of structure at too small values of /i may cause the interpolation function to be irregular 
and hence result in an increase in the error. The tradeoff between the separation and the 


interpolation function regularity is further studied in Section 4.2.3 
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(a) COIL -20 object data set 


(b) ETH -80 object data set 



(c) Yale face data set 


Figure 7: Variation of the misclassification rate with the dimension of the embedding 
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(a) COIL-20 object data set 


(b) ETH-80 object data set 



(c) Yale face data set 


Figure 8 : Variation of the misclassification rate with the separation 


Finally, Figure [9] shows the relation between the misclassification error and the scale 
parameter a of the Gaussian RBF kernels. Each curve is obtained at a different value of the 
/x parameter. The optimum value of the scale parameter minimizing the misclassification 
error can be observed in most experiments. These results confirm the findings of Section 


2.4, suggesting that there exists a unique value of a that minimizes the left hand side of the 


conditions (14), (16), which probabilistically guarantee the correct classification of data. 


4.2.3 Performance analysis of several supervised manifold learning 
algorithms 

Next, we compare several supervised manifold learning methods. We aim to interpret the 
performance differences of different types of embeddings in light of our theoretical results 
in Section 2.3 First, remember from Theorem [ 8 ] that the condition 


VdC (L^S + e) < 7/2 


(32) 


needs to be satisfied (or, equivalently the condition (16) from Theorem [ 9 ]) in order for the 
generalization bounds to hold. This preliminary condition basically states that a cornpro- 
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(a) COIL-20 object data set (b) ETH-80 object data set 



(c) Yale face data set 


Figure 9: Variation of the misclassification rate with the scale parameter 


rnise must be achieved between the regularity of the interpolation function, captured via 
the terms C and L$, and the separation 7 of the embedding of training samples, in order to 
bound the misclassification error. In other words, increasing the separation too much in the 
embedding of training samples does not necessarily lead to good classification performance 
if the interpolation function has poor regularity. 

Hence, when comparing different embeddings in the experiments of this section, we 
define a condition parameter given by 


VdCL ( f > 

Y 

which represents the ratio of the left and right hand sides of ( |32| ) (by fixing the probability 
parameters 5 and e). Setting the Lipschitz constant of the Gaussian RBF kernel as L^ = 
y/2e~ 


” 2 a 1 (see Section 


2.4 


as 


for details), we can equivalently define the condition parameter 

VdC 


K = 


(77 


(33) 
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and study this condition parameter for the supervised dimensionality methods in com¬ 
parison. Note that a smaller condition parameter means that the necessary conditions of 
Theorems [8] and [9] are more likely to be satisfied, hence hinting at the expectation of a 
better classification accuracy. 

We compare the following supervised embeddings: 


• Supervised Laplacian eigenmaps embedding obtained by solving 


(28): 


mintr(Y T L lu Y) — ixtv(Y T L^Y) subject to Y r Y = / 


Fisher embedding obtained by solving 


max 

Y 


tr (Y T L h Y) 
tr (Y T L W Y) ‘ 


(34) 


• Label encoding, which maps each data sample to its label vector of the form [0 0 ... 1... 0], 
where the only nonzero entry corresponds to its class. 

The label encoding method is included in the experiments to provide a reference, which 
can also be regarded as a degenerate supervised manifold learning algorithm that provides 
maximal separation between data samples from different classes. In all of the above meth¬ 
ods the training samples are embedded into the low-dimensional domain, and test samples 
are mapped via Gaussian RBF interpolators and assigned labels via nearest neighbor clas¬ 
sification in the low-dimensional domain. The scale parameter a of the RBF kernel is set 
to a reference value in each dataset within the typical range [0.5,1] where the best accu¬ 
racy is attained. We have fixed the weight parameter as // = 0.01 in all setups, and set 
the dimension of the embedding as equal to the number of classes. In order to study the 


properties of the interpolation function in relation with the condition parameter in (33), we 


also test the supervised Laplacian eigenmaps and the label encoding methods under RBF 
interpolators with high scale parameters, which are chosen as a few times the reference 
a value giving the best results. Finally, we also include in the comparisons a regularized 
version of the supervised Laplacian eigenmaps embedding by controlling the magnitude of 
the interpolation function. 

The results obtained on the COIL-20, ETH-80, Yale and reduced Yale data sets are 
reported respectively in Figures To|[l3 In each figure, panel (a) shows the misclassification 
rates of the embeddings and panel (b) shows the condition parameters of the embeddings at 
different total number of training samples (IV). The logarithm of the condition parameter 
is plotted for ease of visualization. 


The plots in Figures 10 13 show that the label encoding, supervised Laplacian eigen¬ 


maps, and the regularized supervised Laplacian eigenmaps embeddings yield better classifi¬ 
cation accuracy than the other three methods (supervised Fisher, and the embeddings with 
high scale parameters) in all experiments, with the only exception of the cases IV = 60 and 
N = 100 for the reduced Yale data set. Meanwhile, examining the condition parameters of 


7. We use a nonlinear version of the formulation in (Wang and Chen 20091 by removing the constraint 


that the embedding be given by a linear projection of the data. 
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Figure 10: Misclassification rates and the condition parameters of the embeddings for the 
COIL-20 object data set 
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Figure 11: Misclassification rates and the condition parameters of the embeddings for the 
ETH-80 object data set 


the embeddings, we observe that label encoding, supervised Laplacian eigenmaps, and the 
regularized supervised Laplacian eigenmaps embeddings always have a smaller condition 
parameter than the other three methods. This observation confirms the intuition provided 
by the necessary conditions of Theorems [8] and [9j A compromise between the separation and 
the interpolator regularity is required for good classification accuracy. The increase in the 
condition parameter as N increases is since the coefficient bound C involves a summation 
over all training samples. The reason why the embeddings with high a parameters yield 
better classification accuracy than the other ones in the cases N = 60 and N = 100 for the 
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Figure 12: Misclassification rates and the condition parameters of the embeddings for the 
Yale face data set 
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Figure 13: Misclassification rates and the condition parameters of the embeddings for the 
reduced Yale face data set 
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reduced Yale data set is that a larger RBF scale helps better cover up the ambient space 
when the number of training samples is particularly low. 

In the COIL-20 and the reduced Yale data sets, the best classification accuracy is ob¬ 
tained with the regularized supervised Laplacian eigenmaps method, while this is also the 
method having the smallest condition number, except for the smallest two values of N in 
the reduced Yale data set. In the ETH-80 and Full Yale data sets, the classification ac¬ 
curacy of label encoding attains that of the supervised Laplacian eigenmaps method. The 
condition parameter of the label encoding embedding is relatively small in these two data 
sets; in fact, in ETH-80 the label encoding embedding has the smallest condition number 
among all methods. This may be useful for explaining why this simple classification method 
has quite favorable performance in this data set. Likewise, if we leave aside the versions of 
the methods with high-scale interpolators, the Fisher embedding has the highest misclas- 
sification rate compared to label encoding, the supervised Laplacian, and the regularized 
supervised Laplacian embeddings, while it also has the highest condition parameter among 
these methods. El 

To conclude, the results in this section suggest that the experimental findings are in 
agreement with the main results in Section 2.3 justifying the pertinence of the conditions 
(14) and (16) to classification accuracy, hence suggesting that a balance must be sought 
between the separability margin of the embedding and the regularity of the interpolation 
function in supervised manifold learning. 


5. Conclusions 

Most of the current supervised manifold learning algorithms focus on learning represen¬ 
tations of training data, while the generalization properties of these representations have 
not been understood well yet. In this work, we have proposed a theoretical analysis of 
the performance of supervised manifold learning methods. We have presented generaliza¬ 
tion bounds for nonlinear supervised manifold learning algorithms and explored how the 
classification accuracy relates to several setup parameters such as the linear separation 
margin of the embedding, the regularity of the interpolation function, the number of train¬ 
ing samples, and the intrinsic dimensions of the class supports (manifolds). Our results 
suggest that embeddings of training data with good generalization capacities must allow 
the construction of sufficiently regular interpolation functions that extend the mapping to 
new data. We have then examined whether the assumption of linear separability is easy 
to satisfy for structure-preserving supervised embedding algorithms. We have taken the 
supervised Laplacian eigenmaps algorithms as reference, and showed that these methods 
can yield linearly separable embeddings. Providing insight about the generalization capa¬ 
bilities of supervised dimensionality reduction algorithms, our findings can be helpful in the 
classification of low-dimensional data sets. 
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Appendix A. Proof of the results in Section [2] 

A.l Proof of Theorem [2] 

Proof Given x, let Xi E X be the nearest neighbor of x in X that is sampled from v m 

i = arg min llx — Xj II s.t. x« ~ v m . 

3 

Due to the separation hypothesis, 

^mkVi + b mk > 7/2, V/c = 1, . . . , M — 1. 


We have 


f ( x ) + b mk = i f ( Xi ) + b mk + ul k ( f ( x ) - f ( Xi )) 

> <^mk Vi + b mk - \^mk (/(*) ~ /(®i)) | 
>7/2-11 f(x)~f(xi)\\ > 7/2 — L\\x — Xi 


Then if the condition L\\x — Xj|| < 7/2 is satisfied, from the above inequality we have 
u rnk f( x ) + b mk > 0 for all k = 1,..., M — 1. This gives C(x) = m and thus ensures that x 
is classified correctly. 

In the sequel, we lower bound the probability that the distance ||x — Xi\\ between x and 
its nearest neighbor from the same class is smaller than 7 / 2 . We employ the following result 


by Kulkarni and Posner (1995). It is demonstrated in the proof of Theorem 1 in (Kulkarni 
and Posner[ |1995[ ) that, if X contains at least N m samples drawn i.i.d. from u m such that 
N m > AA(e/2, M m ) for some e > 0, then the probability of ||x — Xi\\ being larger than e can 
be upper bounded in terms of the covering number of M 


as 


P(\\x - Xi || > e) < 


M (e/ 2 , Mr 
2 Nm. 


Therefore, for any e such that e < r y/(2L) and N m > Af(e/2,M m ), with probability at least 
1 — jV(e/2, M m )/(2N rn ), we have 


x — Xi || < e < 7/(2 L) 


thus, the class label of x is correctly estimated as C(x) 


m■ due to the above discussion. ■ 


A.2 Proof of Lemma [3] 

Proof We first bound the deviation of f(x) from the sample average of / in the neighbor¬ 
hood of x as 


/(*) 


1 

Q 


Xj€A 


< \\f{x) - m/ll + 


1 

Q 


Y 

Xj£A 



(35) 
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where mj is the conditional expectation of f(u), given u £ B$(x 

1 f 


m f = E u [ f(u ) | u € B s (x)] = 


JBg(x) 


f(u) dv m (u). 


The first term in (35) can be bounded as 


\\f(x)-m f \\ = 


1 


< 


< 


5(2-)) JBs(x) 

-A _ [ 

m( B ^(x)) JBj(x) 


(f{x) - f(u)) dv m {u 
II fix) - f(u )II dv m {u) < 


i{ B s{x)) 


/ L\\x — u|| dv m {u ) 

Jb s (x) 


/„ , [ L5dvm{u) = L5 

'{Bs{x)) J Bs (x) 


(36) 


where the second inequality follows from the fact that / is Lipschitz continuous on the 
support M m , where the measure u m is nonzero. 


The second term in (35) is given by 


i E /(**) - m f 


Q 


i&A 


y/c=i xj£A 


1/2 


Xj) — m k 


(37) 


where rn k denotes the fc-th component of m,f, for k = 1 ,...,d. Consider the random 
variables f k (xj). Defining 

/min = mf /*(«), /max = SU P f k {u), 

ueB s (x) ueBs(x) 

it follows that /max - /min — 2L<5 due to the Lipschitz continuity of /. Then from Hoeffding’s 
inequality, we have 


P 


q E fk ^ - m f 


> e < 2 exp — 


2 Qe 2 


Qe 2 \ 

UL x ~fLn) 2 J 2 LWj- 


< 2 exp 


From the union bound, we get that with probability at least 1 — 2d exp ( ), for all k 


Q 




,-eA 


< e, 


which yields from (37) 


Q 


E /( x i) - m f 


,-eA 


< \7rfe. 
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Combining this result with the bound in (36), we conclude that with probability at least 
1 - 2d exp (-2 


1 


/W-qE f( x 3 


i&A 


< L8 + Vde. 


A.3 Proof of Theorem [5] 

Proof Given the test sample x and a training sample x* drawn i.i.d. with respect to u m , 
the probability that Xi lies within a ^-neighborhood of x is given by 


P(xi e B s (x)) = u m (B s {x)) > r] mt s- 


Then, among the N m samples drawn with respect to u m , the probability that B§(x) contains 
at least Q samples is given by 

Nm /]V \ / \ 9 / \N m -q 

P{\ A \ > Q) = y [vm(Bs(x))J ^1 - v m (B 6 (x))j 

>E ( Nm )(Vm,s) q (l-Vm,6) N -- q 
^ q J 

where the set A is defined as in ([5]) . The last expression above is the probability of having 
at least Q successes out of N m realizations of a Bernoulli random variable with probability 
parameter r] m ^. This probability can be lower bounded using a tail bound for binomial 
distributions. We thus have 


P(|A| > Q) > 1 — exp 


/-2(TV mW -Q) 2 \ 

V N m J 


which simply follows from interpreting |A| as the sum of of N m i.i.d. observations of a 
Bernoulli distributed random variable and then applying Hoeffding’s inequality as shown 
by Herbrich (1999), under the hypothesis that N m > Q/rj mt s- 

Assuming that B$(x) contains at least Q samples, Lemma [3] states that with probability 
at least 

1 - 2d exp > 1 ~ 2 <^x p (Sh>) 


V 2 L 2 6 2 J 


\ 2 L 2 5 2 J 


the deviation between f{x) and the sample average of its neighbors is bounded as 


/(®) 


1 


J 2 /(**) 

Xj£A 


< Lb + Vde. 
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Hence, with probability at least 

'-2 (N m r) mt s ~ Q ) 2 


1 — exp 


> 1 — exp 


Nr, 


-2 ( N m r] mt s - Qf 

N m 


i-Mfflpf—f^) 


— 2d exp 


Qe 2 \ 

2L 2 <5 2 / 


we have 


/(*) - E 


,eA 


< L5 + \/de. 


(38) 


The class label of a test sample x drawn from v m is correctly estimated with respect to 
the classifier Q if 

4 f{x) + b mk >0, \/k = 1,..., M - 1, fc / m. 

If the condition in (|38|) is satisfied, for all fc / m, we have 


W rnk fi X ) + X] ( X i) + bmk + ^ ( /(®) “ 

xj^A \ 


PI 


E 


> 


X] /( X i) + “ ll/P) "^T E /(*j)ll 


,-eA 


PI 


,eA 


> 7q/2 - || fix) - ^| E /( x i)ll > 7q/2 - L5 - Vde > 0 . 


,eA 


Here, we obtain the second inequality from the hypothesis that the embedding is Q-mean 
separable with margin larger than qg, which implies that the embedding is also i?-mean 
separable with margin larger than yg, for R > Q. Then the last inequality is due to the 
condition ([8]) on the interpolation function in the theorem. We thus get that with probability 
at least 


1 — exp 


-2 jN m r) mt s - Qf 

N m 


— 2d exp — 


Qe 2 \ 

2L 2 5 2 ) 


^mk fi x ) + bmk > 0 for all k / m, hence, the sample x is correctly classified. This concludes 
the proof of the theorem. ■ 


A.4 Proof of Theorem [6] 

Proof Remember from the proof of Theorem [5] that with probability at least 

'-2 iN mVm:S - Q) 2 \ ( Qe 2 \ 


1 — exp 


N n 


— 2d exp 


2L 2 5 2 


the ^-neighborhood Bsix) of a test sample x from class m contains at least Q samples from 
class m, and 


fi x ) 


1 

P 


E /fo) 

Xj£A 


< Lb + Vde 


(39) 
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where A is the set of training samples in B$(x ) from class m. 

Let x. L . Xj G A be two training samples from class m in B§(x). As || Xi — Xj || < 25, by 
the hypothesis on the embedding, we have ||y ? ; — ]jj || = \\f(xi) — f(xj) || < D 2 s, which gives 


!!/(*<) - Y 


x-i = 


,eA 




Y (fVi) - fVj)) 




< pr Y !!/(*<) ~ f( x i)W - ° 25 - 


,-eA 


Then, for any 37 G B$(x), 

11/0) - /(xi)|| = ||/(s) - |^| /( x i) + |4l X] “ /(®0 


,eA 






< 11/0*0 - |4t Y /(®i) \\+ d 2S- 

' ' x,ei 


Combining this with (39), we get that with probability at least 

-2 ( N m r] mt s - Q ) 2 


1 — exp 


Nr, 


— 2d exp 


Qe 2 \ 

2L 2 5 2 ) 


Bg(x) will contain at least Q samples Xi from class m such that 

II f(x) - f(xi )|| < L5 + Vde + D 25 . 


(40) 


Now, assuming (40), let x\ be a training sample from another class (other than m). We 
have 


11/0*0 - /(x')|| > ||/(xi) - f(x'i) II - ||/(s) - f{xi) II > 7 - (L5 + Vde + D 25 ) 


which follows from (40) and the hypothesis on the embedding that |/(x,;) — /(x')|| > 7. 

It follows from the condition (10) that 7 > 2L5 + 2 Vde + 2D 2 s■ Using this in the above 
equation, we get 

II f{x) - f{x'i)\\ > L5 + Vde + D 2S . 


This means that the distance of f(x) to the embedding of any other sample from another 
class is more than L5 + Vde + D 2 $, while there are samples from its own class within a 
distance of L5+Vde+D 2 $ to f(x). Therefore, x is classified correctly with nearest-neighbor 
classification in the low-dimensional domain of embedding. 


A.5 Proof of Lemma [7] 

Proof The deviation of each component f k {x) of the interpolator from the sample average 
in the neighborhood of x is given by 



1 

Q 


E 

Xj^A 



®*ll) 


1 

Q 


Y ~ 

Xj£A 



(41) 
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We thus proceed by studying the term 


~ Xi\\) 


1 

Q 


Xj^A 


(42) 


which will then be used in the above expression to arrive at the stated result. 


Now let Xi E X be any training sample. In order to study the term in (42), we first look 


at 


X — Xi 


- E u [4>(\\u - Xi\\) | u E Bs(x)] 


where E u [^»(||w — x*||) | u E Bs(x)] denotes the conditional expectation of 4>(\\u — Xi||) over 
u, given u E Bg(x). The conditional expectation is given by 


E u [(j)(\\u - Xi\\) | u E B s (x)\ 


1 

Gn (Bjj (s)) 



Sill) dv m (u). 


We have 


|s - Xi\\) - E u \4>(\\u - Xi\\) | u E B s (x)\ 




IB g (x) 


X - Xi\\) - <f>(\\u - Xi||)) du m {u) 


< 


|<KI|s - Xi || ) - 4>{\\u ~ Xi || ) | dv m {u). 


Vmi'Bgix')} JBs(x) 

The term in the integral is bounded as 

|<K||x - Si||) - 4>(\\u - Xi 11) | < Lcj, | ||x - Xi || - ||w - Xi || < L# ||x - U \ 
Using this in the above term, we get 


<KI|x - Si||) - E u [(j)(\\u - Xi||) | u E B s (x)] 

<—[ \\x-u\\di'm(u) = L ( j,E u [\\u-x\\\ueBs(x)] ^ 

Vm(B S (x)) J Bg(x) 

<L^5. 


We now analyze the term in (42) for a given x% for two different cases, i.e., for Xi ^ B$(x) 
and Xi E Bs(x). We first look at the case Xi ^ Bs(x). For Xj E Bg(x), let 


Cj ■■= <t>(\\xj - Si||). 

The observations (j are i.i.d. (since Xj are i.i.d.) with mean m( = E n [^>(||w— Xi||) | u E Bg(x)] 
and take values in the interval Cmin < Cj < Cmax, where 


Cmin := inf >(||u-Xi||), 
ueB$(x) 


Cmax := Slip (j)(\\u- Xi\\). 
ueB s (x) 
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Since for any U\,U 2 G Bg(x), ||tti — U 2 W < 2 5, it follows from the Lipschitz continuity of <f> 
that (max — Cm in < 2L^,<5. Using this together with the Hoeffding’s inequality, we get 


P 




Q 


> e < 2 exp — 


2Qe 2 


(Cmax Cmin) 


. \2 I — 2eX P I " 


Qe 2 \ 

H 62 ) 


(44) 


We have 


£ 0U\x<-Xi 


Lj | 




< \<j>(\\x - Xi\\) - m c | + 


m <-ijE 


Fj -®<l 


,-eA 


Using (43) and (44) in the above equation, it holds with probability at least 


1 — 2 exp — 


of ) 

2L l s ’-) 


that 


x-xi\\)-^ 


< Lfod + e. 


Next, we study the case Xi G Bs(x). For any fixed Xi G B$(x), hence Xi G A, we have: 
1 


x ~ x i\\)~Q £ 


Fi -®*l 


- ®i||) + — 7T 1 0(IF - ®i||) - E'/’dki - Zill) - ^ El 


<9 


Q 


< 


<9 


X - Xi - 0 Xi - x, 


+ 


Q 
Q- 1 
Q 


Fj “ 


x - Xi - 


dr E 


Q- 1 


Xj - Xj I 




The hrst term above is bounded as 

1 

Q 


x - Xi - 0 Xi - Xi 


< 




Q 


Next, similarly to the analysis of the case Xj / B$(x), we get that for Xj G B§(x) with 
probability at least 

/ (Q- 1)^ 2 \ 


1 — 2 exp — - 


2& 


it holds that 


x - x,; — 


Q - 1 


E 


ax/SA^ir;} 


< + e, 


hence 


X — Xi|| ) — <PUFj _ 


Q 


i eA 


~^o~ + \ r {Lrp6 +e) - L(ps +e ' 
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Combining the analyses of the cases Xi 7 ^ B$(x) and x t £ Bg(x), we conclude that for any 
given Xi £ X, 


P 




< L'pd + e >1 — 2 exp 


W-l)»A 

)' 


Therefore, applying the union bound on all N samples Xi in X, we get that with probability 
at least 


1 — 2N exp 


2 L%P 


it holds that 



1 

Q 


- 

XjGA 



< + 6 


(45) 


for all Xi £ X. 


We can now use this in (41) to bound the deviation of f k (x ) from the empirical mean 


of f k in the neighbourhood of x. Assuming that the condition 
obtain 


holds for all Xi £ X, we 


/‘to - 4 E /‘to) 

^ a:,eA 


IV 

£< 

2=1 


x_Xi ll)-^ S 


Xj - Xi\ 


N 


^ {L^d + e) ^ ] |cf| < C(L 0<5 + e), 


i=l 


which gives 

ll/(*) ~ ^ H ^ 


1/2 


,-eA 


, fc=i 


E(^)-qE^)] I <V^C(V + 




We thus get 


P ( 11 / 0*0 ^ V^CM + e) j > 1 - 21 Vexp ^- ^ 9L 2p j 


which completes the proof. 


A.6 Proof of Theorem [8] 

Proof Remember from the proof of Theorem [5] that 


P(|A| > Q) > 1 — exp 


-2 (lV m T? m ,5 - Qf 
N m 
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then 


Lemma [7] states that, if B$(x ) contains at least Q samples from class m, i.e., |A| > Q, 


p (ll/P) - pj y /Pi)H - v'dCCM + e) j > 1 - 277 exp j 


> 1 — 2iV exp — 


(Q ~ !)e 2 

2L 2 5 2 


Hence, combining these two results (multiplying both probabilities), we get that with 
probability at least 


1 — exp 


-2 ( N m r] m: s - Qf 

N m 


1 — 2N exp — 


(Q - 1) e 2 

2 L*P 


1 , 2 (N m r] m s Q) 2 \ OAr ( (Q 1) c 2 

> 1 — exp -—- — 2 N exp — - 


N„ 


2L 2 5 2 


it holds that 


ll/P) - pj /Pi) II - ^ C ( L 4> S 


+ e 


(46) 


x;£yl 


A test sample x drawn from is classified correctly with the linear classifier if 
Umk /P) +bmk> 0, Vfc = 1 , . . . , M — 1 , k / 771. 


If the condition in (46) is satisfied, for all k ^ m, we have 


k-Vnfc /P) P bmk ^mk |^| 


y / (®j)+ 6 m fc+ uj^ lk 

Xj£A 



1 


y /(*j) 

Xj£A 


> “mku 7 y /(*j)+- n/(®) ■ or y /pj) 11 

' ' ij-eA ' ' xjeA 

> 7Q/2 - ||/p) - t4t y /Pi)II ^ 7q/2 - \ r dC(L (t> 5 + e) > 0. 

I I ZjSA 


We thus conclude that with probability at least 

-2 (N m r] m ^ - Q) 2 


1 — exp 


N„ 


2N exp 


(Q-i)P l 

) 


^mk /P) P bmk > 0 for all k / m, hence, the class label of x is estimated correctly. ■ 
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A.7 Proof of Theorem [9] 

Proof First, recall from the proof of Theorem [ 8 ] that, with probability at least 


1 — exp 


/ -2 (N m r] mi s - Q) 2 \ 
\ N m J 


2 N exp 


H 62 J 


the ^-neighborhood B$(x) of a test sample x from class m contains at least Q samples from 
class m, and 

11/0*0 - t^t /( x i)H - ^(-M + e ) ( 47 ) 

' ' Xj&A 

where A is the set of training samples in B$(x ) from class m. 

Then it is easy to show that (as in the proof of Theorem [b]), with probability at least 


1 — exp 


/-2 (N m rfmfi - Q) 2 \ 

V N m J 


2 N exp 


( («-l)e 2 5 

r J 


B$(x) will contain at least Q samples x* from class m such that 

II f(x) - f(xi) || < \/rfC(L 0(5 + e) + D 2 s- (48) 


Hence, for a training sample x\ from another class (other than m), we have 


11/0*0 - /(®i)|| > ||/(®») - f(x'i) II - ||/(a0 - /(®i)|| > 7 - (VdCiLtpS + e) + D 2 s ) 


which follows from (48) and the hypothesis on the embedding that || f(xi) — /(x'^W > 7 . 

Due to the condition (16), we have 7 > ‘l\fdC (L^d + e) + 2 D 2 s- Using this above 
equation, we obtain 

||/(3) - /(x-')ll > VdB(L^5 + e) + D 25 . 


Therefore, the distance of f(x) to the embedding of the samples from other classes is more 
than \/~dC{L ( f ) 5 + e) + D 2 $, while there are samples from its own class within a distance 
of \/dC(L ( j ) 6 + e) + D 2 $ to f(x). We thus conclude that the class label of x is estimated 
correctly with nearest-neighbor classification in the low-dimensional domain of embedding. 


Appendix B. Proof of the results in Section [3] 

B.l Proof of Lemma 1131 

Proof The coordinate vector y is the eigenvector of the matrix L w — yL^ corresponding 
to its minimum eigenvalue. Hence, 

y = arg mm £ T (L W - yL b )£. 

Il€ll=i 
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— 1/2 

Equivalently, defining the degree-normalized coordinates z = D w 7 y, and thus replacing 

1/2 

the above £ by D v ', £. we have 

2 = arg min iV(£) 

A r (0=^ /2 (^-h^)^ /2 e 

= £ T {D w - W w )£ - ^eiDu.D- 1 ) 1 / 2 (D b - W b ) (D- l D w ) l / 2 t. 

Then, denoting /% = d w (i)/d b (i), the term 1V(£) can be rearranged as 


(49) 


-/V(£) — ^ ^ £i f d w (i ) £i ^ ' £j Wij j y ^ ' £* ( d w (i) £i ^ ' £j u>ij \J 

i v ' I V j~ 6 * 

= ^ ^ ^ ^ — £,j) w ij M ^ ^ ^ ] (ft£* — y/Pifij Cj) w ij 

i j~ w i i j~bi 

= X X ( >/ ~ £i£j) w ij - d X X (&$ - VM~Mj)wij 

i j~ w i i j~ b i 

which gives [^] 

JV(0 = XI & _ 0) 2 w v - M X (v^& “ %/##&) <% 


(50) 


l~wj 




by grouping the neighboring (i, j) pairs in the inner sums. Now, for any £ G M JVxl such 
that £ T -D«,£ = 1 , we dehne £* as follows. 


£* = 
Si 


-i£ii if a = i 
|£i| if Q = 2 


(51) 


Clearly, £* also satisfies (£*) T A«£* = 1. From (50), it can be easily checked that iV(£*) < 
iV(£) for any £, Then, a minimizer z of the problem (49) has to be of the separable form 
defined in (51); otherwise z* would yield a smaller value for the function N, which would 
contradict the fact that z is a minimizer. Note that the equality N(z*) = N(z) holds only 
if z = z* or z = — z*, thus when z is separable. Therefore, the embedding z satisfies the 
condition 

zt < 0 if Ci = 1 , Zi > 0 if Ci = 2 

or the equivalent condition 

Zi < 0 if Ci = 2, Zi >0 if (7* = 1 . 

Finally, since y* = \Jd w (i ) Zi, the same property also holds for the embedding y. 


9. In our notation, the terms i j and i ~{, j in the summation indices as in (501 refer to edges rather 
than neighboring (i, jr)-p a irs; i.e., each pair is counted only once in the summation. 
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B.2 Proof of Theorem 1141 

we have 


Proof From (49) and 


2 : = arg min y (& - £j) 2 Wij - \x 


w ij- 


(52) 


Thus, at the optimal solution z the objective function takes the value 

n(z ) = y (zj - Zj ) 2 - n y (yjhzi - yy?) 

i~ b j 

In the following, we derive a lower bound for the first sum and an upper bound for the 


(53) 


second sum in (53). We begin with the first sum. Let i\ >m ini h,max, * 2 , min and 12,max denote 
the indices of the data samples in class 1 and class 2 that are respectively mapped to the 
extremal coordinates zi imin , zi >max , z 2 ,min, z 2 ,max, where 


Zk,min — min z %, 

i : Ci=k 


Zk,max — IXiax Z% , 
i : Ci=k 


Let Pi = {{x ki _ l ,x ki )}^ 1 be a shortest path of length L x joining x ilmin and x ilmax and 
P 2 = {(^rn-n^ru)}^! be a shortest path of length L 2 joining x i2min and x i2rnax . We have 


Li 


^2 


y o* - z 3? w ij > y - -fci-i) 2 +y ( 




Wr 




2=1 


2=1 


L 2 


(54) 


P ®mtn,l ^ 'j ( z kj z ki— 1) T W m i n ^2 ^ Z nil ) 


i =1 


i=l 


where the first inequality simply follows from the fact that the set of edges making up 
Pi U P 2 are contained in the set of all edges in G^,. For a sequence {ai}|b 0 , the following 
inequality holds. 


{.O'L ®o) — ^ ^ ifli (hi— l) T ^ ' (®i Qj—l)(Oj Oj_i) 


i=l 


i,j =1 
i¥=j 


L L L 

P ^ a*—1) T ~ ^ ' ((a* Qj—1) T (flj Oj—1) ) — L ^ ' (U/. Rj—1) 

i=l 


*,j=l 


2—1 


Hence, 


y (a* - «i-i ) 2 > -^(ol - oo) 


2 — 1 


Using this inequality in (54), we get 


y y - Zj) 2 > 


l~wj 


L\ 


( z i ,max Z\^min) “b 


P 2 


1-2 ,mai Z2 t min) ■ 
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Since the path lengths L\ and L 2 are upper bounded by the diameters D 1 and D 2 , we finally 
obtain the lower bound 


E t \2 \ w min,l . . 2 . w min,2 , \2 

\Zi Zj) Wij — \Z\jmax Z\^ m i n ) + 




D 1 


D 2 


[ {z 2 ,max ^2,min ) 2 . (55) 


Next, we find an upper bound for the second sum in (53). Using Lemma 13, we obtain the 
following inequality. 


^ ^ {^\/ftiZi yj ftj Zj ^ Wij < 'y ^ (z'2,max ftmax^ij 




1 


(56) 


— ^ (^2,mni Z\ xnin) ftmax 1) 


b 

max 


Now, since the solution z in (52) minimizes the objective function N(£), we have 


IV(z) — A m i n (L tu ^Lft) 


where A m i n (-) and A max (-) respectively denote the minimum and the maximum eigenvalues 
of a matrix. For two Hermitian matrices A and B, the inequality A m i n (A + L>) < A m ; n (^4) + 
Amax(-B) holds. As L w and Lb are graph Laplacian matrices, we have A m i n (L w ) — A m i n (.Lfr) — 
0 and thus 


-N(z) — A m i n (L UI jLLb) L A m i n (L lu ) -f- A ma x( f^Lft) — A m i n (Z/ u) ) /xA m in(Lb) — 0. 


Using in (53) the above inequality and the lower and upper bounds in (|55j) and (56), we 
obtain 


o > n(z) = y (zj - Zj ) 2 - ii y 


w 


V 


> 


l~w3 
Wmin, 1 




D 


(„ \2 i w min,2 ( __ _ n, 2 

V-R,max T „ \Z2,max Z2 ) min) 


- ^(>2 ,max Z\ xnin) ft max A) 


^2 

fc 

max' 


Hence 

W m in,l 

D, 


(-1 ,max Zl,min) 


\2 ^min ,2 ✓ \2 ^ ^ / \2 


n (*2 ,max Z2 } min ) < o/^(-2 ,max "l,mm) ftma X V max . (57) 

n»2 2 


The RHS of the above inequality is related to the overall support Z 2 , ma x ~ Zi,mm °f the 
data, whereas the terms on the LHS are related to the individual supports zi )ma x — zi, mj ; n 
and Z 2 t max — Z 2 ,min of the two classes in the learnt embedding. Meanwhile, the separation 
Z 2 ,min — zi t max between the two classes is given by the gap between the overall support and 
the sums of the individual supports. In order to use the above inequality in view of this 
observation, we first derive a lower bound on the RHS term. Since z T D w z = 1, we have 


1 = y Z 2 d w {i) = y z 2 d w {i) + y z 2 d w (i) 

i i:Ci = 1 i:Ci =2 

— ^l,min 'ft ^ d w (i) + ^ 2 , max ^ ^ d w (i) ^gmin^l d" -^2, max ^2■ 
i:Ci=l i:Ci=2 
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This gives 


1 


_L~-i > _ 

"Lfmin ' ^2,max — tt- 
Vrr, 


Hence, we obtain the following lower bound on the overall support 

/ _ \2 ^ 2 , 2 1 
\ z 2,max z \,min.) _ z 2,max ' Z 1 ,min — 


Vrr, 


(58) 


Denoting the supports of class 1 and class 2 and the overall support as 

Si=*i ,max %1 ,mim S‘2 = Z 2 ,max z 2,mini *S* — Z2,max z \ .mini 


we have from (57) 


w minimi + S 2 ) < ^ H S PmaxVmax 

which yields the following upper bound on the total support of the two classes 


Si + S 2 < J2(Sl + S 2 ) < sJ 

v V W min 


We can thus lower bound the separation Z 2 min ~ z\ max as 


^2,rain Z\,max — S + S2) ^ S I 1 


' l^fimax Vmc 


'Wmir 


provided that // < w m i n /{PmaxV^^)■ From the lower bound on the overall support in (58), 
we lower bound the separation as follows 


z 2,min z 1 .max T 




1 - 


fvfimax 1/7U 


W r 


Finally, since the separation of any embedding with dimension d > 1 is at least as much as 
the separation z 2>m i n — z\ tTnax of the embedding of dimension d = 1, the above lower bound 
holds for any d > 1 as well. ■ 


B.3 Proof of Corollary |15| 

Proof The one-dimensional embedding z is given as the solution of the constrained opti¬ 
mization problem 

z = argmin N(£) s.t. D(£) = 1 

where 

N(0 = i T D l J 2 (L w - iiL b )DK 2 Z, D(£) = Z t D w £. 

Defining the Lagrangian function 

A(e,A) = jv(e) + A(D(e)-i) 
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at the optimal solution z, we have 


V* A = V a A = 0 


where and V A respectively denote the derivatives of A with respect to £ and A. Thus, 
at £ = 2 , 

dh dN(Q | } dD^) Q 


<9£i d£i 


d£i 


for all i = 1,..., N. From (50), the derivatives of iV(£) and £)(£) at z are given by 

dN(0 

= 2-^ AK - Zi ~ z i) Wi 3 ~ ^ 

~~ ' ' j~bi 


<9£i 

dD(0 


= ^ 2 ( z i ~ z j) w ij ~ A ^2 2 (V^* “ Vpj z j) Vfii 

£ =z j~wi 


W, 


V 


<9£i 


— 2 d w (i) Zi 




which yields 


y: (Zi - Zj)wij - y ^2 (y/Jkzi - y/Wj z j) y/Pi w ij + a d w (i) Zi = 0 (59) 






for all i. At i = ii^max, as z attains its maximal value zi^ max for class 1, we have 

\ d w (i\,max) Zl,max = ^ ^ {Zj Z>\^rnax) f Wi\,maxj 

Jaw'll,max 

T y yy Pii,max z i,max ~~ \fPj z 'j S j \J~P’‘ 


'll,max ^il,maxj 


Job'll,max 

— ft fimin (z 2 ,mm ‘ZljTnaa; ) d&(*i ,max) • 


Hence 


1-^1,max | — ^l,max — 


/r/3 

min ( z 2 ,'t 


- Z i >r 


c 


A rfui 


> 


/I Pmin ( z 2,: 


- z u 


X/3ri 


(60) 


We proceed by deriving an upper bound for A. The gradients of iV(£) and D(£) are given 
by 

V ? AT = 2D l J 2 (L w - /zL 6 )£>V 2 £, = 2£> w £. 

From the condition V^A = 0 at £ = 2 , we have 

D l J 2 (L w - iiL b )D\l 2 z + \D w z = 0 
(L w - nL b )y + Xy = 0. 


1 /2 

Since y = D„; 2 is the unit-norm eigenvector of — yL b corresponding to its smallest 
eigenvalue, the Lagrangian multiplier A is given by 


A 


Amin (A,, 


yLb)- 
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We can lower bound the minimum eigenvalue as 


^mm(L w H Lb) > A m i n (Z/ U i) + A m in( hLb ) — 0 A t A max (Z/fo) > 2/X 


since the eigenvalues of a graph Laplacian are upper bounded by 2. This gives A < 2/r. 
Using this upper bound on A in (60), we obtain 


~ 1^1 fimin / ~ ~ 

Zl,max\ — 7\~n \^2,min %1 ,max 

^ Pmax 


Repeating the same steps for i 


i 2 imin following (59), one can similarly show that 


N 1 Pmin f ~ ~ 

^2,min — 7 ^~q \Z2,min %l,max 
^ Pmax 


B.4 Proof of Theorem 1161 


We first present two lemmas that will be useful for proving 


Theorem 16 


Lemma 17 Let A £ M> NxN be a symmetric matrix with eigenvalue decomposition A = 
UAU t , where U is an orthogonal matrix and A is a diagonal matrix consisting of the 
eigenvalues Ai,...,Ajv- Consider a symmetric perturbation A A on A. Let the perturbed 
matrix A = A + A A have the eigenvalue decomposition A = UAJJ T . 

Assume that the eigenvalues A* have a separation of at least rj, i.e., for all distinct i,j, 
one has |Aj — Aj| > p. Then the inner products of the corresponding eigenvectors of A and 
A are lower bounded as 


\uJ Uj \ > 


L 4 ||ARP 

V V 2 


for all j = 1,..., N, where Uj denotes the j-th column of U. 


Proof Defining R = U T U , we look for a lower bound on the diagonal entries of R. It will 
be helpful to examine the term 

||AR - A?A|| = ||A R - (AA )R - RA\\ < ||AR - RA\\ + ||AA|| < ||AR - A?A|| + ||AA|| (61) 


where AA = A — A and the last inequality follows from the fact that the variation in the 
eigenvalues is upper bounded by the norm of the perturbation matrix. 

We proceed by bounding the term ||A R — i?A||. First observe that 

(A A)U = (A - A)U = ( UAU t - UAU t )U = UAR - UA 


which gives 

||AA|| = \\UAR — UA\\ = \\U T {UAR - UA)\\ = ||AR - RA\\. 


54 









Classification with supervised manifold learning 


Using this in (61), we get 

||Ai2- RA\\ < 2\\AA\\. (62) 

Since each column of R is given by the projection of a unit norm vector on an orthogonal 
basis, it is unit-norm. Denoting the (i, j)-th entry of R by Rij, we have 


i^i = (i-E4-) 1/2 - 

*74 ? 


(63) 


We proceed by bounding the sum of the entries RA. Notice that the (i,j)-th entry of 
A R — RA is given by (A* — A j)Rij- For each j, 

E(A 4 - \jfR% < ||Ai? - i?A|| 2 < 4||AA|| 2 

where the first inequality follows from the fact that the norm of the j-th column of a matrix 
can be upper bounded by its operator norm, and the second inequality is due to (62). Due 
to the eigenvalue separation hypothesis, the first term above can be lower bounded as 


Y.ih-XjfRl >v 2 J2 R 


*74? 


which gives 


< 


*74? 


*t4? 

4||AA|| 2 

t f 


From (63), we arrive at the stated result, i.e., for each j 
\uJu j \ = \R jj \ = (l-^R} j ) 1/2 >(l 


4||AA|| s 


2 \ 1/2 


Lemma 18 Let U, U G M> NxN be two orthogonal matrices such that the difference between 
the corresponding columns Ui, Hi of U and U are upper bounded as \\ui — Ui\\ 2 < 5 for some 
5 < 2. Let V = U T and V = U T . Then the difference between the corresponding columns 
Vi, Vi of V and V are upper bounded as 

IK - Vi\\ 2 < 5 + 2 VN Jl - ^1 - 0 . 

Proof Let R = U T JJ. Since Ui and Ui are unit-norm vectors, we have 

IK — || 2 = 2 — 2 ufUi < 5 

therefore 

Ru = ujui > 1 - - > 0 (64) 
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where Ra denotes the i-th diagonal entry of R. From v j = Rvi it follows that 

vf in = vjRvi = vfR d Vi + vf R nd Vi > vf R d Vi - \ vf R nd Vi | (65) 

where R d and R nd denote the components of R consisting respectively of the diagonal and 
the nondiagonal terms. From the condition (64) on the diagonal entries of R, it follows that 
the first term is lower bounded as 


vjR d Vi > 1-6/2. 


Also, from (64), the .f^-norm of each row and each column of R nd is upper bounded by 
\f \ — (1 — 6/ 2) 2 . Bounding the operator norm of R nd in terms of the maximal fu-norms of 
the rows and columns, we get 

\vf R nd Vi\ < \\R nd \\ < VN^/l - (1 - (5/2) 2 . 


Using this together with the inequality (64) in (65), we get 


T 

Vi Vi 


v/]vyi - (1 - 5/2) 2 


which gives the stated result 


\vi — v>i\\ 2 = 2 — 2vfvi < 5 + 2 Vn \/ 1 — (1 — 6/2) 2 . 


We are now ready to prove Theorem 16 

Proof We first look at the separation of the embedding obtained with L c for the reduced 
data graph with all between-category edges removed. The data graph corresponding to 
L c has Q connected components; therefore, L c is a block diagonal matrix consisting of Q 
blocks. Each g-th block is given by the objective matrix L c,q = Lh — [iL/ where Lh and L q h 
are the within-class and the between-class Laplacian matrices of the data graph restricted 
to only the category q. As L c is a block-diagonal matrix, its eigenvalues and eigenvectors 
are given by the union of the eigenvalues and the eigenvectors of the block components L c ' q 
(i.e., their inclusions in by zero-padding). 

Let Y q = ['i/l... y% ] T be the d 9 -dimensional embedding of the N q samples in category q, 
whose columns are the eigenvectors of L c,q . The embedding Y q is assumed to be separable 
with a margin of y c by the theorem hypothesis. Consider the embeddings Y q . Y r of two 
different categories q and r, and two classes k , l respectively from these two categories. By 
the separation hypothesis (30) within each category, there exist hyperplanes wj- and cvf with 


|cu):|| = 1, | uij | = 1, such that for the embedding of any sample y q G from class k and 
any sample y 1 G M. d ' from class l it holds that 


(colfy/ > 7 c /2 
Hfy r j < - 772 . 


( 66 ) 
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Now considering an ordering of all Q categories, we can define the inclusion y\ G of 
each sample y q G M> dQ from each category q, where d = d q and the nonzero entries of 
y\ = [0... 0 ( y q ) T 0 ... 0] 7 are located at the support of category q. Note that each (y q ) T 
corresponds to a row of the coordinate matrix Y c , whose columns are the eigenvectors of 
L c . 

Consider the hyperplane 

^ = i[o...o«'o...iiW)Te« J 


with ||u;?’[|| = 1, formed by the inclusion of and (cu[) r in over the entries corre 

sponding respectively to the c 
separates these two classes as 


sponding respectively to the categories q and r. From (|66l), we get that the hyperplane oj q,r 


k,l 


7 


K;77?> 2 ^ 

Q 

/ q r\T —r ^ 7 


(67) 


2V2' 


This shows that there exists a d-dimensional embedding given by the eigenvectors of L c 
that separates any pair of classes with a margin of at least 7 c /\/ 2 . 

Now observe from Lemma 17 that the correlation between the z'-th eigenvector u\ of L c 
and the corresponding eigenvector m of L is upper bounded as 


\ufui\ > f 




2 


This implies either of the conditions ufu q > £ or uf {—uf) > £. Therefore, the eigenvector 
m of the perturbed objective matrix L has a correlation of at least £ with either u\ or its 
opposite —u\. Meanwhile, the separability of an embedding is invariant to the negation 
of one of the eigenvectors. This corresponds simply to changing the sign of one of the 
coordinates of all data samples (i.e., taking the symmetric of the embedding with respect 
to one axis); therefore, the linear separability remains the same. For this reason, it suffices 
to treat the case uju? > £ for analyzing the separability without loss of generality. 

The condition uju q > £ implies 

\\ui — ttf|| 2 = 2 — 2ufu q < 2 — 2£. (68) 


While this upper bounds the difference between the corresponding eigenvectors of L and L c , 
we need to upper bound the variation between the rows of L and L c , as we are interested 
in the separation obtained with the embedded data coordinates given by the rows of L. 
Denoting the z-th rows of L and L c respectively as yj and yf, from the condition in (68) 
and Lemma 18, the difference between the corresponding rows of these matrices can be 
bounded as 

I \yf — yf\\ 2 < 2 — 2£ + 2 v / lV(l — £ 2 ). (69) 
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As the separability condition in (67) is general and valid for any two categories, we can 
reformulate it as follows. For any pair of classes k,l 6 {1 ,,M}, there exists a hyperplane 
Uk i such that 


Then, from (69) and (70) we have 


“1,1 Vi > 

7 

2^2 

a Ci = k 

VI 

ls> 

Eh 

3 

7 C 

if c t = 1 

2^2 

we have 




(70) 


“Ifih = 


if Cj = k: and 


+ “k,l(Vi - Vi) > uhyi - || yi - yi\\ > -?-j= - (2 - 2 £ + 2\J N(l — £ 2 )^ 


1/2 


“k,lVi = ul t iVi + (JklhH ~ Vi) < “k,lVi + \\Vi -Vi\\< + (2 - 2f + 2^(1 - £ 2 )) 


1/2 


if Cj = l. Hence, the embedding Y given by the eigenvectors of the overall objective matrix 
L is linearly separable with a margin of 


7 = ^=-2 (2 - 2 £ + 2 ^( 1-£ 2 )) 1/2 

if 

7 c >4(l — £ + VJV(l-£ 2 )) V2 . 

We thus arrive at the result stated in the theorem. 
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